Tag explorer feature - Githubissues

rabdill commented 1 month ago

User story(ish): As a website visitor, I can search sample attributes to determine whether the compendium contains data relevant to my research question.

I can search the metadata fields available in the dataset to identify relevant tag names (e.g. host_age/age/subject_years)
I can search sample-level values for the relevant tags to determine which projects have samples with relevant information.
I can export a list of projects to use in filtering the full dataset.

--

Based on an email exchange with a user looking for some host phenotype information, my general idea is for two panels, with the first filtering the rows in the second. None of this nomenclature is very precise, but I'll use "key" and "value" for the sample attributes we currently pull from NCBI.

Panel A

Each row lists a key and how broadly it's available. panela

This is currently 2608 rows in a 63 KB text file.
If users could use a textbox to filter the rows, and toggle row selections on and off, those selected tags would be the ones displayed in Panel B.
Sorting by total samples seems like a reasonable default, but enabling sorting by key might help searching.

Panel B

Each row lists the number of samples in a single project that share a single key-value pair. The values in the "tag" column are filtered by the selections in Panel A. panelb

This is 401,650 rows in a 16.4 MB text file (1.7 MB gzipped), but that's tab-separated values with a lot of repetition; there may be a more efficient way to record this.
It would be useful if panel B could be downloaded as a csv, or just copied in a way that preserves the columns. I was initially thinking this list should be filterable too, and then a user could export some final list of projects selected, but if all the data's in this table anyway, I'm guessing it's easier to export the table rather than add a whole new panel that gets updated dynamically.
Relatedly, it would be helpful if this table could be sorted by all three of these columns, but if someone's going to be loading this in Excel/RStudio/Google Sheets anyway, maybe we don't need to replicate that functionality?

@vincerubinetti you've done way more of this than I have, does this sound like a reasonable way to give people some general info about the attributes? If the size of the Panel B data is an issue, it would be fun to try some minifying, or we could reduce the granularity of the data in some way.

vincerubinetti commented 1 month ago

What you're describing for panel A sounds like it could be a "tags" input, like this: https://mui.com/material-ui/react-autocomplete/#checkboxes The dropdown when searching could show sample and project count for each row (in order of sample count maybe), and searching would be done by the tag name, fuzzily. Unless you think it's important that the user can paginate through all of those rows and do a sort on a column of their choice, in which case yeah it can be a table with checkboxes on the left to select rows.

For panel B, those features all sound doable.

The main concern for me there would be the over-the-wire size, but 2MB isn't all that bad. I'll also do this "lazily", meaning the user's browser wont download that file until they actually start interacting with the "tag explorer".

If there are any easy ways to make that data more compact, let's definitely do it. Worst case it could even be in binary format with a custom encoding.

cgreene commented 1 month ago

On making data more compact -> I'm guessing that what you get from the compression algorithm isn't that far from the lower bound of what you'd get (unless there's data that you can remove). Compression algorithms should be reasonably good at dealing with repeated text. One path that you could consider - if there's a natural way to shard it that aligns with how users will interact - you will lose some compression efficiency but if you need only small pieces, that could still work.

rabdill commented 1 month ago

The tags input looks like a great option! I think sorting by sample count makes sense, especially if there's a utility that would take care of making the list already. As for the data format, it sounds like we could start with the regular compression, and follow up if we have time to get creative with encoding or sharding alphabetically? The files are attached here:

Panel A: key_counts.tsv.gz
Panel B: value_counts.tsv.gz

The queries for generating these are straightforward, but I'm going to forget, so for future reference when we may incorporate this into a release pipeline:

--- Panel A
SELECT t.tag, COUNT(t.srs) AS samples, COUNT(DISTINCT s.project) AS projects
FROM tags t
INNER JOIN samples s
    USING(srs)
GROUP BY 1
ORDER BY 2 DESC,3 DESC,1

--- Panel B
SELECT t.tag, s.project, t.value, COUNT(t.srs) AS samples
FROM tags t
INNER JOIN samples s
    USING(srs)
ON t.srs=s.srs
GROUP BY 1,2,3
ORDER BY 4 DESC

(At least for now, there are no samples with multiple values for a single tag.)

blekhmanlab / compendium_website

Tag explorer feature #31

Panel A

Panel B