Open wangqi0000 opened 1 year ago
Hi @wangqi0000,
Let me answer your questions in reverse order, because they are actually related :-)
(2) The x*
taxon IDs are strain-specific "pseudo-taxon" IDs - i.e. if you have multiple reference genomes for the same species, these will be represented using different x*
taxon IDs in the database.
You have two options to link these back to standard NCBI identifiers.
First, you can parse the database taxonomy (e.g. in /databases/miniSeq+H/taxonomy
- the database taxonomy contains the x*
entries and from these you can go back (i.e., one level up) to standard NCBI taxon IDs.
Second, if you want the specific NCBI genome sequence corresponding to an x*
entry, you can query the DB.fa
file, like to:
grep 'x1088' databases/miniSeq+H/DB.fa
(querying for x1088
in the default miniSeq+H
database - the returned lines will contain the NCBI identifiers of the corresponding sequences).
(1) Yes, this is supported - see Step 2 here: https://github.com/DiltheyLab/MetaMaps#databases.
However, one limitation of the existing workflow is that it currently only supported RefSeq-style input directories.
If this does not work for you and if you could give me a few more details about your specific use case (input formats etc.), we could either extend the existing workflow or set up a new preprocessing script that does what you need!
Hi @wangqi0000,
Let me answer your questions in reverse order, because they are actually related :-)
(2) The
x*
taxon IDs are strain-specific "pseudo-taxon" IDs - i.e. if you have multiple reference genomes for the same species, these will be represented using differentx*
taxon IDs in the database.You have two options to link these back to standard NCBI identifiers.
First, you can parse the database taxonomy (e.g. in
/databases/miniSeq+H/taxonomy
- the database taxonomy contains thex*
entries and from these you can go back (i.e., one level up) to standard NCBI taxon IDs.Second, if you want the specific NCBI genome sequence corresponding to an
x*
entry, you can query theDB.fa
file, like to:grep 'x1088' databases/miniSeq+H/DB.fa
(querying forx1088
in the defaultminiSeq+H
database - the returned lines will contain the NCBI identifiers of the corresponding sequences).(1) Yes, this is supported - see Step 2 here: https://github.com/DiltheyLab/MetaMaps#databases.
However, one limitation of the existing workflow is that it currently only supported RefSeq-style input directories.
If this does not work for you and if you could give me a few more details about your specific use case (input formats etc.), we could either extend the existing workflow or set up a new preprocessing script that does what you need!
Thank you so much! This will help me a lot.
Hello,
I am having two issues and need some help in resolving the issue.
(1) I have some strain genome sequences of single species (such as Bacteroides_fragilis.fa), is it possible to construct a database for these sequences? If so, how should it be constructed? (2) When I run metamaps on my dataset, I obtain a file named "classification_results_B1.EM.WIMP" which showed below: Can I know exactly which genome sequence x598 (highlighted in red) represents in NCBI?
Thanks!!
Best, Qi