chanzuckerberg / galago

Interpretation aids for genomic epidemiology
https://chanzuckerberg.github.io/galago/
MIT License
9 stars 2 forks source link

[Bug?] Galago doesn't like UShER JSON (yet) #204

Open AngieHinrichs opened 1 year ago

AngieHinrichs commented 1 year ago

Describe the bug This may be a bug in the JSON produced by the UShER web interface, not Galago, but they're not working together yet so let's figure it out.

Expected behavior / How to reproduce This URL contains an Auspice V2 tree produced by an UShER web interface query:

https://genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_16aa0_445360.json

[unfortunately that is a temporary file, note the "trash" in the name -- it will go away in a couple days, so I have saved a copy here: https://hgwdev.gi.ucsc.edu/~angie/XAY_XBA_XBC_2022-09-28.json ]

So I hoped this Galago Fetch URL would work:

https://galago.czgenepi.org/#/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_16aa0_445360.json

But I get an error "Woops! Error fetching tree file We weren't able to import your tree data. Please confirm your URL is correct and publicly accessible, or upload your JSON file directly below."

Interestingly, I do get farther with fetch if I use the backup copy on a different server:

https://galago.czgenepi.org/#/fetch/hgwdev.gi.ucsc.edu/~angie/XAY_XBA_XBC_2022-09-28.json

-- that gets me as far as the "Analyze your data in Galago" dialog, where I can choose the pathogen (SARS-CoV-2) -- but I can't choose a State/Province, probably because my JSON has only the country level. There is a drop-down for State/Province, but it has no values.

Would it be possible to use the country metadata instead if the state metadata is missing from the JSON?

sidneymbell commented 1 year ago

Ah! I somehow didn't get a notification for this issue. Thanks so much for investigating, @AngieHinrichs !

I just pushed a PR to our staging server to make all geographical data optional, and I'm mostly able to load your file via https://galago-labs.czgenepi.org/#/fetch/https://hgwdev.gi.ucsc.edu/~angie/XAY_XBA_XBC_2022-09-28.json but it hiccups because it expects num_date rather than date.

This is easy to fix on my end. I'll get this up and running on prod by early next week at the latest and let you know as soon as it's ready. Thanks again! So excited :)

AngieHinrichs commented 1 year ago

Great! Yeah, UShER JSON doesn't have all of the cool stuff that Augur JSON does, but I'm glad you can work with it anyway! Looking forward to adding a linkout. 😄

sidneymbell commented 1 year ago

@AngieHinrichs -- I haven't forgotten about this! Got unexpectedly slammed with a few other things this week. Next week is looking wide open, though, and this is top of my list. Thanks for your patience.

AngieHinrichs commented 1 year ago

No worries, same here! :) (except not sure about next week) No pressure from my side. It will be easy to add a linkout whenever.

sidneymbell commented 1 year ago

Hey @AngieHinrichs! At long last (apologies -- covid finally found us after 3 yrs), I've got a fix for this.

The issue was indeed parsing dates on our end. We now accept either date or num_date fields. I also made a couple tweaks to the visualizations to just leave out tips with no date field (there were just a handful in this test JSON with missing dates). Thanks again for flagging the incompatibility and providing the test data!

My current patch of leaving these samples out might not be a great solution for datasets with more than a few missing dates, though. Do most UShER samples come in with dates, or is it common to have a significant percentage of samples without?

AngieHinrichs commented 1 year ago

Hi Sidney! So sorry to hear about the covid, but good job avoiding it for so long. Glad you're back in dev-land.

The "UShER samples" are a mix of sequences from INSDC (GenBank, ENA, DDBJ) and/or GISAID (many sequences are in both and I attempt to de-duplicate). Most of them have dates, but not all, and some (by law in some locations) are year-month-only unfortunately. If it turns out to be a big problem then there are several things we could try, such as suggesting that people choose a larger subtree size in UShER to send onward to you so there's more margin for having to discard some samples.

Is there an optimal range of sizes for Galago input trees? Does it depend on the number of the user's samples of interest? I imagine some users might upload a handful of sequences from an outbreak that probably fall into one or two subtrees, while others might have hundreds of sequences from a week's worth of runs in their lab (potentially resulting in many subtrees). The UShER web interface's default subtree size is 50 which is OK for finding the few most closely related sequences, but for other purposes like evaluating a possible new lineage for pangolin, 1000 is a better size. The max is 5000.

sidneymbell commented 1 year ago

Glad to be back! Although I've got a lot of foggy brain still, so lmk if any of this doesn't make sense :)

We can accommodate any of those tree sizes, although performance is best at <3000-3500ish. We also have some UI tools to help users sift through a given tree to find clades with their samples of interest. One thing to note is that (at least for now) Galago only ingests one tree at a time.

In an ideal world, I'd recommend something along the lines of:

sidneymbell commented 1 year ago

@AngieHinrichs -- another idea we could think about at some point -- Galago helps the user find which clade(s) to generate a report for based on their samples of interest. It could be useful to pass through the names of their input samples via query param, although this could very quickly get too long and cumbersome to be functional. Would need to noodle on this a bit more.

AngieHinrichs commented 1 year ago

Great about the sample size flexibility.

It could be useful to pass through the names of their input samples via query param, although this could very quickly get too long and cumbersome to be functional.

Yeah. Maybe in a text file alongside the JSON file that has the tree? One name per line? Or -- actually they can be extracted from the JSON itself, filter nodes for userOrOld == "uploaded sample" if there's already a convenient way to do that.

sidneymbell commented 1 year ago

Oooh good call. Yeah as long as there's a metadata trait in the JSON I can parse them on ingest.

On Wed, Oct 19, 2022 at 3:51 PM Angie Hinrichs @.***> wrote:

Great about the sample size flexibility.

It could be useful to pass through the names of their input samples via query param, although this could very quickly get too long and cumbersome to be functional.

Yeah. Maybe in a text file alongside the JSON file that has the tree? One name per line? Or -- actually they can be extracted from the JSON itself, filter nodes for userOrOld == "uploaded sample" if there's already a convenient way to do that.

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/galago/issues/204#issuecomment-1284652744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAIYXYWLSYMOV36LDRDTFDWEB3QXANCNFSM6AAAAAAQYLJ7DM . You are receiving this because you were assigned.Message ID: @.***>

AngieHinrichs commented 10 months ago

Hi @sidneymbell -- sorry I let this all get buried in my inbox for, yikes! almost a year! 🤯 But I would still like to link out to Galago. This is the linkout format that I have:

https://galago-labs.czgenepi.org/#/fetch/https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json

but when I try that I get an error message:

image

Javascript console says

index.a7b9082c.js:277 XHR failed loading: GET "https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json".

I can view https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json in my web browser and see its response headers with curl:

curl -SsI https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json

HTTP/1.1 200 OK
Date: Tue, 12 Sep 2023 16:19:15 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5
Last-Modified: Thu, 07 Sep 2023 18:16:07 GMT
ETag: "943b-604c8da97400c"
Accept-Ranges: bytes
Content-Length: 37947
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Range
Content-Type: application/json

? If you don't have time to work on this, no problem! Just wanted to let you know I'm still interested if you do have time. 🙂