PoonLab / gromstole

Quantifying SARS-CoV-2 VoCs from NGS data of wastewater samples
MIT License
3 stars 4 forks source link

estimate-freqs.R fails on example data #85

Closed ArtPoon closed 1 year ago

ArtPoon commented 1 year ago

I'm trying to run scripts/estimate-freqs.R with the example data from the readme. It looks like in issue 71, the sites values in the collection json files were changed, but I don't see nt or aa values in any of the current jsons here at constellations/constellations/definitions or in cov-lineages' versions. Is there a working version of that script from before it was edited to use constellation$sites$nt and constellation$sites$aa in this commit? Or am I looking at something wrong? Thanks!

Originally posted by @skunklem in https://github.com/PoonLab/gromstole/issues/76#issuecomment-1648816796

SandeepThokala commented 1 year ago

estimate-freqs.R script worked fine with the example data from the README

GopiGugan commented 1 year ago

The user was running into an issue because they were likely using the original constellation files from the cov-lineages repo or a file that was not modified in our forked repo.

We forked cov-lineages/constellations and modified sites in select constellation files to display the nucleotide substitution associated with the AA substitution:

https://github.com/PoonLab/constellations/blob/c8ce4b757084333fdaef527ec707aaf84361c70c/constellations/definitions/cBA.5.json#L20-L29

The following files were modified:

skunklem commented 1 year ago

That seems to be the case. Just following the links from your main gromstole repo to constellations, I end up viewing different files like this one below (not edited to have the "aa" and "nt" distinctions) rather than the ones linked above. https://github.com/PoonLab/constellations/blob/47418a5605501552e0793fe02e5a3fffd010dc2c/constellations/definitions/cBA.5.json Do you happen to have a script that can convert the "sites" details to this other format? That would be helpful for any of cov-lineages' constellations jsons that y'all haven't modified yet.

skunklem commented 1 year ago

Really, what I'm hoping to do (the reason I asked the above question) is to have a way of using gromstole with the most up-to-date constellations on wastewater samples that span from early strains to current ones. I'll be comparing its lineage predictions with other deconvolution tools, so I'd rather not be limited to just the few lineages with constellations files that y'all have manually prepared. If that's not really feasible, please let me know, as it will mean gromstole isn't a tool I should be considering.

ArtPoon commented 1 year ago

Hi @skunklem - we stopped updating constallations a while ago. The original purpose of Gromstole was to rapidly extract mutation frequencies from wastewater NGS data. It still does a pretty decent job of doing this. However, we were then asked to provide variant frequency estimates to distinguish between Delta and Omicron in wastewater. The binomial regression method did a reasonable job of this. Now that there are hundreds of variants that are only slightly different from each other, however, gromstole is no longer an appropriate tool for calling variant frequencies and I would direct you to one of the several deconvolution methods that have since been released.

skunklem commented 1 year ago

That makes sense. Thanks for the insights.

ArtPoon commented 1 year ago

Unfortunately the constellation files are no longer being used in our ww processing, and converting the constellations is not readily automated because we had been manually selecting a subset of mutations to "uniquely define" a given variant. (The latter was not sustainable, which is why we switched to a deconvolution method, i.e., Freyja). Closing as a wontfix issue.