EtthelWindels / tb_hiv

Code associated with "HIV co-infection is associated with reduced Mycobacterium tuberculosis transmissibility in sub-Saharan Africa"
1 stars 0 forks source link

Input metadata and sequence data #1

Open corneliusroemer opened 4 days ago

corneliusroemer commented 4 days ago

I saw the study in the SIB newsletter and wanted to try to replicate the analysis (at least the part without sequence data) but am struggling to find the input meta- and sequence/SNP data. A lot of value of the study seems to be in the creation of the dataset so it would be great if this was available. Did I miss it or is it not (yet) available? Would you be able to publish the curated input dataset here as well?

EtthelWindels commented 4 days ago

Most sequence data were already published before, with project accession numbers provided in the README file and in the manuscript. The metadata are provided as Supplementary Table S5 of the manuscript and I now also added them here (/data folder).

corneliusroemer commented 3 days ago

Amazing, thanks @EtthelWindels, that was fast! I totally missed the fact that Table 5 exists - I somehow only saw the figures. Great to have it also in the repo, that makes it a lot easier to discover.

Regarding the sequences, it requires quite a bit of clicking around to find the contigs for each run. Did you process the runs into the final fasta that you loaded into BEAST or did you use the unassembled contigs. The BEAST XML specifies files like this: https://github.com/EtthelWindels/tb_hiv/blob/0c4cb3acf7fa9fe083d1d9664f5c1f69a040466a/analyses/1-main/Ma_main.xml#L21 so this might be somehow processed data? To facilitate reproduction, it would be great, I think, if you could make those FASTA files available as well. If compressed with something like zstd I don't think they should be all that large - with a bit of luck within the 100MB file limit for Github, even.

Another reason why the FASTA would be great to have in the repo is that it probably contains the result of the downsampling to 400 sequences - IIUC, I think the metadata doesn't say which sequences you ended up using after downsampling.