GSA / datagov-wptheme

Data.gov WordPress Theme (obsolete)
https://www.data.gov
Other
1.88k stars 411 forks source link

make datasets WITHIN Collections searchable from the homepage #584

Open rebeccawilliams opened 9 years ago

rebeccawilliams commented 9 years ago

Currently you can search for Collection information, but not the datasets within until you have entered the Collection. This inadvertently hides a lot of relevant data for users!

philipashlock commented 9 years ago

If you remember what data.gov was like before similar datasets were grouped into collections searching was actually much more difficult. I think collections were actually intentionally left out of top level search results to make searching and browsing easier so I this behavior is actually more of a feature than a bug, but I'm sure there are ways to improve upon it.

However, just to give you a sense of how much harder it is to browse through search results that include collections (where many datasets in the collection satisfy the search terms) consider this: catalog.data.gov has 139,169 datasets and of those only 245 are collections, but those 245 collections include 1,020,286 datasets - often with nearly identical metadata. Typically, it actually becomes much more difficult to search and browse when your search terms satisfy tens of thousands of nearly identical items.

Probably the best way to handle this is for the search query to be able to include datasets within the collections but for the search results to only display the collection, not the individual datasets - or perhaps the collection also shows the top result from within the collection that matched the search, but otherwise you would need to further refine the search within the collection to see more than the top result.

Or maybe the search results would always detect when they include lots of similar items and attempt to automatically filter those out unless the user selects an option to show them all (google often does this)

kvuppala commented 9 years ago

@philipashlock @rebeccawilliams I like the idea of searching through the collection records but display only collection record to start with, and when the collection is selected matched search results should be displayed with an option to show all.

amilan17 commented 9 years ago

I second Kishore's suggestion (if I understand it). As a provider of many homogeneous-like records - I don't want the searchability of these records to be suppressed until the user finds the collection level metadata record and this solution could support this use case without overcluttering the search results.

alanswx commented 9 years ago

Right now the code that implements the WAF Collection creates a parent node, and then for each of the children it set's a variable collection_package_id - it then filters those out of every search. ( "fq" (filter query) on collection_pacakge_id -- in the geodatagov plugin.py code )

It works two ways, either you are searching within a collection and it adds collection_package_id: or it puts in -collection_package_id:["" TO *]

Census uses the collections. But there is a problem. They have a Collection of one theme, like: "TIGER/Line Shapefile, 2014, state, California, Current Unified School Districts State-based Shapefile" "TIGER/Line Shapefile, 2014, state, Alaska, Current Unified School Districts State-based Shapefile"

if you search for those phrases with quotes you won't find the collection. Also, if you were to search for +Tiger +2014 +California +School +Districts

it won't find the record.

I suggest instead of a post-filter to use solr Field Collapsing: https://wiki.apache.org/solr/FieldCollapsing

This will group the entries together, and then you will find a "collection" (group in Field Collapsing terminology) and it will cluster it together as one thing. This allows you to make more complicated collections of things in addition to the simple Census example, because it won't hide anything.

Then we can decide to just show the first item of the group, or we could even show the first item, and then indent and show n items as a cluster. Which could be pretty nice.

/solr/collection1/select?q=TIGER/Line%20Shapefile,%202014,%20state,%20California,%20Current%20Unified%20School%20Districts%20State-based%20Shapefile&wt=xml&indent=true&group=true&group.field=collection_package_id