blekhmanlab / compendium_website

Website for the Human Microbiome Compendium
http://microbiomap.org/
0 stars 1 forks source link

Tag explorer feature #31

Open rabdill opened 1 month ago

rabdill commented 1 month ago

User story(ish): As a website visitor, I can search sample attributes to determine whether the compendium contains data relevant to my research question.

--

Based on an email exchange with a user looking for some host phenotype information, my general idea is for two panels, with the first filtering the rows in the second. None of this nomenclature is very precise, but I'll use "key" and "value" for the sample attributes we currently pull from NCBI.

Panel A

Each row lists a key and how broadly it's available. panela

Panel B

Each row lists the number of samples in a single project that share a single key-value pair. The values in the "tag" column are filtered by the selections in Panel A. panelb

@vincerubinetti you've done way more of this than I have, does this sound like a reasonable way to give people some general info about the attributes? If the size of the Panel B data is an issue, it would be fun to try some minifying, or we could reduce the granularity of the data in some way.

vincerubinetti commented 1 month ago

What you're describing for panel A sounds like it could be a "tags" input, like this: https://mui.com/material-ui/react-autocomplete/#checkboxes The dropdown when searching could show sample and project count for each row (in order of sample count maybe), and searching would be done by the tag name, fuzzily. Unless you think it's important that the user can paginate through all of those rows and do a sort on a column of their choice, in which case yeah it can be a table with checkboxes on the left to select rows.

For panel B, those features all sound doable.

The main concern for me there would be the over-the-wire size, but 2MB isn't all that bad. I'll also do this "lazily", meaning the user's browser wont download that file until they actually start interacting with the "tag explorer".

If there are any easy ways to make that data more compact, let's definitely do it. Worst case it could even be in binary format with a custom encoding.

cgreene commented 1 month ago

On making data more compact -> I'm guessing that what you get from the compression algorithm isn't that far from the lower bound of what you'd get (unless there's data that you can remove). Compression algorithms should be reasonably good at dealing with repeated text. One path that you could consider - if there's a natural way to shard it that aligns with how users will interact - you will lose some compression efficiency but if you need only small pieces, that could still work.

rabdill commented 1 month ago

The tags input looks like a great option! I think sorting by sample count makes sense, especially if there's a utility that would take care of making the list already. As for the data format, it sounds like we could start with the regular compression, and follow up if we have time to get creative with encoding or sharding alphabetically? The files are attached here:

The queries for generating these are straightforward, but I'm going to forget, so for future reference when we may incorporate this into a release pipeline:

--- Panel A
SELECT t.tag, COUNT(t.srs) AS samples, COUNT(DISTINCT s.project) AS projects
FROM tags t
INNER JOIN samples s
    USING(srs)
GROUP BY 1
ORDER BY 2 DESC,3 DESC,1

--- Panel B
SELECT t.tag, s.project, t.value, COUNT(t.srs) AS samples
FROM tags t
INNER JOIN samples s
    USING(srs)
ON t.srs=s.srs
GROUP BY 1,2,3
ORDER BY 4 DESC

(At least for now, there are no samples with multiple values for a single tag.)