DiltheyLab / MetaMaps

Long-read metagenomic analysis
Other
99 stars 23 forks source link

Custom Database creation for one species #75

Open wangqi0000 opened 1 year ago

wangqi0000 commented 1 year ago

Hello,

I am having two issues and need some help in resolving the issue.

(1) I have some strain genome sequences of single species (such as Bacteroides_fragilis.fa), is it possible to construct a database for these sequences? If so, how should it be constructed? (2) When I run metamaps on my dataset, I obtain a file named "classification_results_B1.EM.WIMP" which showed below: image Can I know exactly which genome sequence x598 (highlighted in red) represents in NCBI?

Thanks!!

Best, Qi

AlexanderDilthey commented 1 year ago

Hi @wangqi0000,

Let me answer your questions in reverse order, because they are actually related :-)

(2) The x* taxon IDs are strain-specific "pseudo-taxon" IDs - i.e. if you have multiple reference genomes for the same species, these will be represented using different x*taxon IDs in the database.

You have two options to link these back to standard NCBI identifiers.

First, you can parse the database taxonomy (e.g. in /databases/miniSeq+H/taxonomy - the database taxonomy contains the x* entries and from these you can go back (i.e., one level up) to standard NCBI taxon IDs.

Second, if you want the specific NCBI genome sequence corresponding to an x* entry, you can query the DB.fa file, like to: grep 'x1088' databases/miniSeq+H/DB.fa (querying for x1088 in the default miniSeq+H database - the returned lines will contain the NCBI identifiers of the corresponding sequences).

(1) Yes, this is supported - see Step 2 here: https://github.com/DiltheyLab/MetaMaps#databases.

However, one limitation of the existing workflow is that it currently only supported RefSeq-style input directories.

If this does not work for you and if you could give me a few more details about your specific use case (input formats etc.), we could either extend the existing workflow or set up a new preprocessing script that does what you need!

wangqi0000 commented 1 year ago

Hi @wangqi0000,

Let me answer your questions in reverse order, because they are actually related :-)

(2) The x* taxon IDs are strain-specific "pseudo-taxon" IDs - i.e. if you have multiple reference genomes for the same species, these will be represented using different x*taxon IDs in the database.

You have two options to link these back to standard NCBI identifiers.

First, you can parse the database taxonomy (e.g. in /databases/miniSeq+H/taxonomy - the database taxonomy contains the x* entries and from these you can go back (i.e., one level up) to standard NCBI taxon IDs.

Second, if you want the specific NCBI genome sequence corresponding to an x* entry, you can query the DB.fa file, like to: grep 'x1088' databases/miniSeq+H/DB.fa (querying for x1088 in the default miniSeq+H database - the returned lines will contain the NCBI identifiers of the corresponding sequences).

(1) Yes, this is supported - see Step 2 here: https://github.com/DiltheyLab/MetaMaps#databases.

However, one limitation of the existing workflow is that it currently only supported RefSeq-style input directories.

If this does not work for you and if you could give me a few more details about your specific use case (input formats etc.), we could either extend the existing workflow or set up a new preprocessing script that does what you need!

Thank you so much! This will help me a lot.