DiltheyLab / MetaMaps

Long-read metagenomic analysis
Other
98 stars 23 forks source link

How to download others branches using downloadRefSeq.pl #5

Closed RxLoutre closed 6 years ago

RxLoutre commented 6 years ago

Hello there,

I would like to know if there is anyway to download data from other species than this list : "archae bacteria fungi protozoa viral" ?

For specific purpose, I would like to construct a wider database, with less species from each branche (like 1 or 2 fungi, 1 or 2 archae etc) including vertebrate species. I see in the perl code that it was somewhat planned to do so.

Line 90 :

#my @target_subdirs = qw/bacteria fungi protozoa viral invertebrate plant unknown vertebrate_other vertebrate_mammalian/;

I'm really tempted to uncomment it, but I guess it is commented in purpose. Any idea of how I could achieve that ? Is it something you are working on ?

Cheers,

Roxane

AlexanderDilthey commented 6 years ago

Hi Roxane,

Yes, more comfortable ways to construct new reference databases is something I am working on.

First, if you want to download more data, you can specify --targetBranches - e.g. --targetBranches fungi,protozoa

What currently happens is that the subsequent steps will iterate through your downloaded directory; search for gzipped assembly files; extract (i.e. un-gzip) some of these, based on some criteria (e.g. based on whether the assembly is complete); and then, in a final step, find all unzipped assembly files, combine these into a single file, and introduce appropriate additional taxon IDs.

I guess this process might not be ideal for you - you probably want a very comprehensive download, and then somehow specify which files from the download you want to use? What would be a good way for you to specify which assembly files to include?

Best wishes,

Alex

RxLoutre commented 6 years ago

Hi Alex,

Alright, I'm not sure I got all of it. Something that could be easily configurable would be parsing a file listing all the genome you want to be included in. I can't help it but thinking about something that look like FastQScreen conf file :

#DATABASE       Human   /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Homo_sapiens/genome/Hg38/BWA/hg38_normalized.fa
#DATABASE       Mouse   /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Mus_musculus/genome/bwa/Mus_musculus.GRCm38.75.dna.toplevel.fa
#DATABASE       ZebraFish       /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Danio_rerio/genome/bwa/Danio_Rerio_ref_Zv9.fa
#DATABASE       AThaliana       /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Arabidopsis_thaliana/genome/TAIR10/bwa/TAIR10.fasta
#DATABASE       Lizard  /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Anas_platyrhynchos/Genome/BWA/GCF_000355885.1_BGI_duck_1.0_genomic.fasta
#DATABASE       Drosophila      /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Drosophilia_melanogaster/genome/bwa/Drosophila_melanogaster.BDGP5.69.dna.toplevel.fa
#DATABASE       Lactobacillus /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Lactococcus_lactis/subsp_lactis_Il1403/genome/bwa/NC_002662.fasta
DATABASE        Xanthomonas     /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Xanthomonas_campestris/genome/bwa/NC007086.1.fasta

I don't know if it can fit the rest of the pipeline, but this is the idea that came up in my mind when you asked !

AlexanderDilthey commented 6 years ago

Something like this might well work - the only additional thing we'd need is a 'taxon ID' column. This would be feasible, wouldn't it?


Von: Roxane Boyer notifications@github.com Gesendet: Montag, 13. August 2018 12:02 An: DiltheyLab/MetaMaps Cc: Alexander Dilthey; Comment Betreff: Re: [DiltheyLab/MetaMaps] How to download others branches using downloadRefSeq.pl (#5)

Hi Alex,

Alright, I'm not sure I got all of it. Something that could be easily configurable would be parsing a file listing all the genome you want to be included in. I can't help it but thinking about something that look like FastQScreen conf file :

DATABASE Human /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Homo_sapiens/genome/Hg38/BWA/hg38_normalized.fa

DATABASE Mouse /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Mus_musculus/genome/bwa/Mus_musculus.GRCm38.75.dna.toplevel.fa

DATABASE ZebraFish /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Danio_rerio/genome/bwa/Danio_Rerio_ref_Zv9.fa

DATABASE AThaliana /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Arabidopsis_thaliana/genome/TAIR10/bwa/TAIR10.fasta

DATABASE Lizard /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Anas_platyrhynchos/Genome/BWA/GCF_000355885.1_BGI_duck_1.0_genomic.fasta

DATABASE Drosophila /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Drosophilia_melanogaster/genome/bwa/Drosophila_melanogaster.BDGP5.69.dna.toplevel.fa

DATABASE Lactobacillus /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Lactococcus_lactis/subsp_lactis_Il1403/genome/bwa/NC_002662.fasta

DATABASE Xanthomonas /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Xanthomonas_campestris/genome/bwa/NC007086.1.fasta

I don't know if it can fit the rest of the pipeline, but this is the idea that came up in my mind when you asked !

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/DiltheyLab/MetaMaps/issues/5#issuecomment-412468813, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGFsaWbOBQeoJrHAk4XSi1T_dPi40l3Lks5uQU6lgaJpZM4V4RB9.

RxLoutre commented 6 years ago

Well, I'm hyped about it, I think it could work :)

AlexanderDilthey commented 6 years ago

OK! I'll implement this and get back to you! Give me a few days :-)

AlexanderDilthey commented 6 years ago

OK - I've got something, let me know what you think and whether this is useful to you!

I've added a new command combineAndAnnotateReferences.pl - basically the idea is that you specify input genomes and taxon IDs in a file, and this script will prepare everything for you to pass on to buildDB.pl.

Here is an example command line:

perl combineAndAnnotateReferences.pl --inputFileList tests/fileList --outputFile tests/combinedHumanMouse.fa --taxonomyInDirectory downloads/taxonomy --taxonomyOutDirectory tests/combinedHumanMouse_NCBI_taxonomy_modified

The file fileList could look like this (one line per reference genome - i.e. here we've got two complete human genomes and one mouse genome):

9606 tests/shortHuman1.fa
9606 tests/shortHuman2.fa
10090 tests/shortMouse.fa

... and the database construction process could be completed with:

perl buildDB.pl --DB databases/testHumanMouse --FASTAs tests/combinedHumanMouse.fa --taxonomy tests/combinedHumanMouse_NCBI_taxonomy_modified

RxLoutre commented 6 years ago

Hi Alexander,

Thanks for working hard and fast on that case !

So, if I understand well, the new script combineAndAnnotateReferences.pl is a patch so I can use only a file listing all the path to my genomes.fasta as the only input, right ? Which kinda replace the first step which was to download the database from RefSeq.

I'm still uncertain about the meaning of the taxon ID in here.. If I have a bunch of genomes, is the taxon ID just a number I can made up in order to identify my genome or is it related to refSeq somehow and I have to fetch a proper TaxonID ? I mean, this is probably an important information you use for assigning a read right ? But I have no idea on how to attribute such a ID to my genomes...

I'm also not sure to understand what is needed with --taxonomyInDirectory, which is an input to the new script. How can I have this information ? If I didn't downloaded the genome from RefSeq ? Because so far, I don't even have a taxonomy directory. I saw one on an example run, it contains 4 .dmp files. delnodes, merged, names and nodes. And to be honest I have no idea of what informations they contains and what it means.

Sorry for all thoses question, I'm still very new to the field !

Thanks for your time,

Rox

AlexanderDilthey commented 6 years ago

Hi Roxane,

These are excellent questions!

First and most importantly, you're right in that this script replaces the RefSeq download.

Now, for the taxonomy: MetaMaps needs this information to understand how the input genomes are related to each other. It is contained in the 4 files you mentioned. The format of these files is identical to the format of the NCBI taxonomy database. By far the easiest method is therefore to stay within the NCBI taxonomy - almost any known species should have a unique ID in the NCBI system (https://www.ncbi.nlm.nih.gov/taxonomy), and typically this ID is part of the assembly metadata or can be obtained by querying the database for the name of your species. You can get a complete download of the most recent version of the NCBI taxonomy from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz, and the directory contained therein can be passed as input to --taxonomyInDirectory. It is instructive to open the files contained in there in a text editor, to get an idea of what they look like and what information they contain.

So, the easiest way forward is to identify the NCBI taxonomy ID for each of your input genomes and pass this information to MetaMaps.

In theory you could also make up your own taxonomy, store it in NCBI format and use it with MetaMaps, but unless you know exactly what you're doing I would heavily advise against that.

Please let me know if you have further questions!

RxLoutre commented 6 years ago

Hi Alexander,

Thanks a lot for your answers ! I'm kind of a newbie in the taxonomy field. I downloaded ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz and untared the file in a taxonomy directory within metamaps. Even if it contains more informations than it should (I mean by that more genomes than I'm going to use), is it still okay to give this as a --taxonomyInDirectory ?

The next step is to create the input file with my genomes and their corresponding NCBI taxonomy ID, and then finally try to build the database.

I have one more question : I'm working with a cluster, which means I do not install tools myself and I do not have the right to update MetaMaps installation. I knew that I'll probably have to update the tool at least once, as MetaMaps is very recent. But I'd like to make it easy for the people working at the datacenter and cluster facility. How can they update MetaMaps easily ? Is it something than can be done with git for example ?

Thanks again for your feedback !

Cheers,

Roxane

RxLoutre commented 6 years ago

Oh and one other simple question : The fileList use as a seperator a tabulation or a simple space ? ANd again an other one : can we comment this file with line starting with a # or not ?

AlexanderDilthey commented 6 years ago

Hi Roxane,

for the file list, any whitespace character (space or tab) is fine. Currently no comments are accepted. This could be easily changed, though.

Even if it contains more informations than it should (I mean by that more genomes than I'm going to use), is it still okay to give this as a --taxonomyInDirectory ?

Absolutely yes! You can always use the complete NCBI taxonomy!

Regarding the cluster question: you're right that the source code is still evolving rapidly and that future changes are likely. Currently the best way to update MetaMaps is to do a git pull followed be a rebuild (make clean; make metamaps). At some point in the future we'll switch to a more formal release schedule.

Best wishes,

Alex

RxLoutre commented 6 years ago

Great, thanks a lot for all your answers, have a great day ! And good luck with MetaMaps : )

Rox

AlexanderDilthey commented 6 years ago

Thank you! Please let me know if you run into problems or if you have additional feature requests! :-)

Alex

ms-gx commented 4 years ago

To generate my own DB: Do I ideally provide species_taxid or taxid (as for example given in assembly_summary_genbank.txt). I guess taxid, because MetaMaps goes down to strain level.