Dataset Type Should Include "Non-Geospatial"

Raseman commented 10 years ago

In the lefthand navigation bar in the data catalog, the only option under Dataset Type is "Geospatial." It would be very useful to also have "non-geospatial" as well.

JeanneHolm commented 10 years ago

On the current catalog.data.gov left hand navigation.

kvuppala commented 10 years ago

There was a reason why we didn't want to include "Non-Geospatial" in the facet, needs discussion

philipashlock commented 10 years ago

I don't see any reason not to provide this as an option. At the very least we can document it on the advanced search page. Currently you can use this negation operator to filter out geospatial data: -metadata_type:geospatial

http://catalog.data.gov/dataset?q=-metadata_type:geospatial

Raseman commented 10 years ago

This one's still open - I think it would really help to be able to sort through just the non-geospatial data sets for many users.

cew821 commented 10 years ago

+1. I would argue non geospatial should be the default, and getting geospatial results should be something you have to choose to do (just like "non-images" are the default in a google search). Do we have any analytics on search behavior? If so I could crunch through them to see if that proposal makes sense.

JeanneHolm commented 10 years ago

I don't think we should pre-filter the default search. Some of our most popular datasets are geospatial (like Earthquake data) and the vast majority of the data is classified as geospatial. User testing clearly showed that people start searching on Data.gov broadly and then want to narrow down, rather than start with a narrow pre-filtered set and figure out how to expand.

I agree that if we have geospatial as a discriminator, that we also include non-geospatial.

nsinai commented 10 years ago

I'd agree with Charles -- my instinct would be to show non-geospatial data as the default, as geo-spatial data files overwhelms out the rest of the data (in both search and browse). But I'd defer to an analysis of search behavior data -- can we get Charles that data? @JeanneHolm, how did the user testing show that users prefer searching all the data -- did we A/B test it?

I'd also suggest that the default ordering (before a user searches) should be popularity, not relevance. This would make it more inviting to browse.

JeanneHolm commented 10 years ago

Users specifically stated that they wanted to narrow their search from the whole corpus of data down to what they needed. They were frustrated when they had to expand their search after already narrowing it.

During usability testing, people found the following useful in searching: --Topics (previously communities) that had relevant and curated lists of datasets ("Community sites and what the staff is thinking help me narrow my search" and "I see the tag safety...I don't feel lost anymore") --Filters in the catalog (many people used these to narrow their search) --Flags on the datasets indicating federal, university, state, etc. ("I like using color to say that its state or federal and that you have local data too") --Navigation items that allow people to browse in addition to searching ("You want me to search, but I don’t search") --Using relevance as the ranking for search results ("Search worked really well with the sorting. I did a search and didn’t get a bunch of unrelated stuff.")

People also stated that the following would be useful: --Advanced search page (open as #62) --Tutorial on search and description of catalog (currently at http://next.data.gov/newcataloginfo)

Using popularity as the discriminator can be tricky and doesn't scale well with the new method of harvesting from the data.JSON files. As we update the catalog with the content from the agency data.JSON files, the ability to track popularity is problematic. How we define that popularity is also of concern (# of downloads--only from Data.gov or also from the agency site?; quality rating--do we reintroduce this feature?; # of uses of that data--requires self-reporting). Until we have some history with being able to gather some popularity over time with the harvested content, relevance is the more reliable method. In addition, users clearly expected to see the results that were most relevant to their query, and wanted to be able to construct complex queries to dive quickly to that relevant data.

cew821 commented 10 years ago

I agree with all of Jean's points above about ways to make navigation easier. I think the only thing under consideration here is the default settings of the search filter.

The best way to measure search effectiveness is by the relevancy of the search results returned for a given query. Perhaps after we launch, we can run a study focused on that topic. As I understand it, search relevancy is most often measured by doing user testing, where users conduct a search and rate the relevancy of each result based on their intuition (0=not relevant, 1-2. The rank order plus relevancy scores for each result are then combined using a formula worked out by some researchers to come up with a single "score" for the search function. This provides a score that can be tracked over time, and one that can be used to A/B test different search algorithm/filter options.

I've also heard that a metric people use to track search effectiveness is "click through rate" from search results. That is, "what percentage of users click on one of the results after conducting a search". If this is low, you can infer they aren't seeing relevant results for their search terms.

Google Analytics has a number of built in site search analysis tools: https://support.google.com/analytics/answer/1032321?hl=en that would be a good place to start.

References:

http://stackoverflow.com/questions/7142264/solr-relevancy-how-to-a-b-test-for-search-quality/7173137#7173137

http://www.kaushik.net/avinash/kick-butt-with-internal-site-search-analytics/

On Tue, Jan 7, 2014 at 8:08 AM, Jeanne Holm notifications@github.comwrote:

Users specifically stated that they wanted to narrow their search from the whole corpus of data down to what they needed. They were frustrated when they had to expand their search after already narrowing it.

During usability testing, people found the following useful in searching: --Topics (previously communities) that had relevant and curated lists of datasets ("Community sites and what the staff is thinking help me narrow my search" and "I see the tag safety...I don't feel lost anymore") --Filters in the catalog (many people used these to narrow their search) --Flags on the datasets indicating federal, university, state, etc. ("I like using color to say that its state or federal and that you have local data too") --Navigation items that allow people to browse in addition to searching ("You want me to search, but I don’t search") --Using relevance as the ranking for search results ("Search worked really well with the sorting. I did a search and didn’t get a bunch of unrelated stuff.")

People also stated that the following would be useful: --Advanced search page (open as #62https://github.com/GSA/data.gov/issues/62 ) --Tutorial on search and description of catalog (currently at http://next.data.gov/newcataloginfo)

Using popularity as the discriminator can be tricky and doesn't scale well with the new method of harvesting from the data.JSON files. As we update the catalog with the content from the agency data.JSON files, the ability to track popularity is problematic. How we define that popularity is also of concern (# of downloads--only from Data.gov or also from the agency site?; quality rating--do we reintroduce this feature?; # of uses of that data--requires self-reporting). Until we have some history with being able to gather some popularity over time with the harvested content, relevance is the more reliable method. In addition, users clearly expected to see the results that were most relevant to their query, and wanted to be able to construct complex queries to dive quickly to that relevant data.

— Reply to this email directly or view it on GitHubhttps://github.com/GSA/data.gov/issues/57#issuecomment-31735589 .

JeanneHolm commented 10 years ago

@nsinai has a good point on the default if no query is entered. We can assume someone is just perusing the data and so "popularity" may be best (given all the caveats already mentioned about this being a bit in flux with JSON harvesting for the next month or so). Second option is for "Last modified". Open to either.

@cew821 We do have Google Analytics running on the site and could use that for the click through rate option, as well as others ideas. The relevancy now is related to the key words or other advanced search attributes entered. Again, once we replace the catalog contents with the harvest from the JSON files, we may need to true this up depending on how well the previous URLs map to new ones (since that is how Google Analytics tracks traffic on the item).

dialsunny commented 10 years ago

Dataset type filter now shows both geospatial and non-geospatial filter options.

GSA / datagov-wptheme

Dataset Type Should Include "Non-Geospatial" #57