broadinstitute / pyro-cov

Pyro models of SARS-CoV-2 variants
Apache License 2.0
77 stars 28 forks source link

alternative data sources/sharing replication dataset? #30

Open cwhittaker1000 opened 2 years ago

cwhittaker1000 commented 2 years ago

Hi there! Congrats on this work, it's amazing! We're currently doing some work looking at epistatic effects, and are hoping to build on the incredible work you folks have done with PyR0.

So far, we haven't been able to get access to a data feed from GISAID - I saw some others had had similar issues and inquired about alternative data sources (#13) for running the model. Can you advise on what data we'd need to go down this route (and where to get it from) - any potential advice you could provide on how to modify the code would also be hugely appreciated!

If the above isn't viable, would it be possible for you to share the processed dataset used for the analyses in your Science paper so we can make progress on extending the code while we continue to work out access issues?

Thanks in advance and congrats again on some awesome work!

Xiang-Leo commented 2 years ago

Hi, I'm also curious about the GISAID data feed. No response even email to GISAID. According to the term of GISAID, authors can not share precessed dataset with you. But there are alternative methods for data. One method is to download dataset from open source like nextstrain provided. If you have an account of GISAID, you can downloaded all sequences and correspondent metadata. Then preprocess these data with preprocess_gisaid.py.

Jialu-Zuo commented 3 weeks ago

@Xiang-Leo Dear Xiang-Leo, I'm recently trying to replicate this work and the GISAID data is not available. Do you mean that we can download the data from https://nextstrain.org/ncov/open/global/all-time, change the neme of metadata and preprocess the data with preprocess_gisaid.py file? I tried to preprocess the data using preprocess_usher.py file and found that the number of the regions can be expanded to about only 300, extremely less than 1560 in the article. I'm wondering if the situation is the same when preprocessing the data from nextstrain with preprocess_gisaid.py.