epi2me-labs / wf-metagenomics

Metagenomic classification of long-read sequencing data
Other
49 stars 23 forks source link

Minimap2 custom database #80

Closed EsbergA closed 2 months ago

EsbergA commented 6 months ago

Ask away!

Hi all, We are working with Minimap2 within the EPI2me lab platform and trying to build a custom database based on the 1,500 full-length 16S rRNA sequences present in the HOMD database. As I understand the information, this Minimap2 database should contain two files, (i) a RefSeq file with all sequences and (ii) a Taxonomic file with all taxa levels for each RefSeq id. I have tried many different formats, but all stop with an error, so if anyone has an example or can guide me in how a custom database for Minimap2 should be configured, it would be greatly appreciated. Does the Minimap2 custom database need a link to NCBI TaxID and Taxadump? As I do not have these ID for my custom sequences.. AE

nggvs commented 6 months ago

Hi @EsbergA Could you try again with the latest version (2.9.0)? There was a bug which prevent the use of large databases. You need the reference and the ref2taxid, if you are using ref2taxid different from the NCBI, you would also need to provide your custom taxonomy database: https://labs.epi2me.io/how-to-meta-offline/#the-structure-and-composition-of-the-databases Please if you are still getting an error, would you mind to open an issue so I can take a look?

Thank you very much for using the workflow!

EsbergA commented 6 months ago

Hi nggvs, I'm still receiving errors; now I also tried wf-16S with the minimap2 function, but I got the same results.

I am using the EPI2me-Lab platform, and my runs always terminate with an error similar to this: "Error: The reference 712237 is not found in your ref2taxid file. Please make sure that the ref2taxid matches the reference"

As I understand Minimap2, it requires a FASTA file and a ref2taxid:

My FASTA file (containing approx. 800 full 16S gene sequences) has the structure of this:

712237 GAGTTTGATCCTGGCTCAGAGCGAACGCTGGCGGCAG..........

My matching ref2taxid file is built up like this: 712237 k_Bacteria;p_Proteobacteria;c_Alphaproteobacteria;o_Caulobacterales;f_Caulobacteraceae;g_Caulobacter;s_Caulobacter_sp._HMT_002

As I can see 712237 is the correct Taxid for NCBI, I have tried many different combinations but never get it to run, help is greatly appreciated EsbergA

nggvs commented 6 months ago

Hi @EsbergA , Yes, to use the minimap2 approach you can use a custom database, which requires a FASTA file and a ref2taxid file. This second one should contain the taxid and the reference. So this would be an example:

> ref1
AAAA....

and in the corresponding taxid, the corresponding entry should be like this (as you can see is the refID and then the taxid)
ref1 712237

You can download the default ref2taxid and the corresponding FASTA if you want to see an example.

You don't need to provide the full lineage string if you are using NCBI taxids, but if you are using different ones, then you need to provide a taxonomy db (which links the taxids to the full lineage name).

Please let me know if you can fix the problem! Thank you very much

EsbergA commented 6 months ago

Thanks, Natalia for your response! I have tried changing the files accordingly, but it does not go through (see attached files), it stops with the following error. Error message: "Error: The number of elements of the "RefSeq.tsv" doesn't match the number of elements in the "Fasta.fna"." Can you spot any errors in these files? @EsbergA

Från: Natalia Garcia Garcia @.> Datum: lördag, 24 februari 2024 11:33 Till: epi2me-labs/wf-metagenomics @.> Kopia: Anders Esberg @.>, Mention @.> Ämne: Re: [epi2me-labs/wf-metagenomics] Minimap2 custom database (Issue #80)

Hi @EsbergAhttps://github.com/EsbergA , Yes, to use the minimap2 approach you can use a custom database, which requires a FASTA file and a ref2taxid file. This second one should contain the taxid and the reference. So this would be an example:

ref1

AAAA....

and in the corresponding taxid, the corresponding entry should be like this (as you can see is the refID and then the taxid)

ref1 712237

You can download the default ref2taxidhttps://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ref2taxid.targloci.tsv and the corresponding FASTAhttps://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna if you want to see an example.

Please let me know if you can fix the problem! Thank you very much

— Reply to this email directly, view it on GitHubhttps://github.com/epi2me-labs/wf-metagenomics/issues/80#issuecomment-1962323080, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCRRRSDYGY6RNPPVYYDE4S3YVG6ZFAVCNFSM6AAAAABDCTBCHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGMZDGMBYGA. You are receiving this because you were mentioned.Message ID: @.***>

EsbergA commented 6 months ago

Dear Natalia I have now tried the downloaded NCBI files you kindly supported me with, and run these with the Nextflow wf-16S pipeline and I get this error:

Från: Natalia Garcia Garcia @.> Datum: lördag, 24 februari 2024 11:33 Till: epi2me-labs/wf-metagenomics @.> Kopia: Anders Esberg @.>, Mention @.> Ämne: Re: [epi2me-labs/wf-metagenomics] Minimap2 custom database (Issue #80)

Hi @EsbergAhttps://github.com/EsbergA , Yes, to use the minimap2 approach you can use a custom database, which requires a FASTA file and a ref2taxid file. This second one should contain the taxid and the reference. So this would be an example:

ref1

AAAA....

and in the corresponding taxid, the corresponding entry should be like this (as you can see is the refID and then the taxid)

ref1 712237

You can download the default ref2taxidhttps://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ref2taxid.targloci.tsv and the corresponding FASTAhttps://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna if you want to see an example.

Please let me know if you can fix the problem! Thank you very much

— Reply to this email directly, view it on GitHubhttps://github.com/epi2me-labs/wf-metagenomics/issues/80#issuecomment-1962323080, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCRRRSDYGY6RNPPVYYDE4S3YVG6ZFAVCNFSM6AAAAABDCTBCHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGMZDGMBYGA. You are receiving this because you were mentioned.Message ID: @.***>

nggvs commented 6 months ago

Hi @EsbergA ,

I don't see the attached files that you have mentioned. Could you also copy the command that you are running to see the options or if you are running it through the app, the options that you are supplying so that I can take a look? The files that I have mentioned earlier are the default ones (I mean should be the same that using the --database_set ncbi_16s_18s without providing an external database).

Thank you very much and apologies for the delay

eveotj commented 5 months ago

Hello, I am having the same issue when analysing my data (with a custom database):

  1. when I use the fasta file format for my reference, it says that the length doesn't match with the taxid file (which I confirm that it does match)
  2. when I use the mmi file format for my reference, I have the error that the reference of a gene doesn't exist in my taxid file but they both have it...

Did you manage to find a solution to this problem? Thank you in advance for the help

nggvs commented 4 months ago

Hi @eveotj , Thank you for using the workflow and apologies for the late answer. Is it possible that you can share the database or the error that you are getting with the gene?

Janca-Pieters95 commented 3 months ago

Hi @nggvs

I am experiencing the same problem. I am using Minimap2 with a custom database. (all versions are up to date)

FASTA File input: >81152 CTGNCGGCGTGCCTAACACATNCAAGTCGAGCGGTGCTACGGAGGTCTTCGGACTGAAGTAGCATAGCGGCGGACGGGTGAGTAATACACAGGAACGTGCCCCTTGGAGGCGGATAGCTGTGGGAAACTGCAGGTAATCCGCCGTAAGCTCGGGAGAGGAAAGCCGGAAGGCGCCGAGGGAGCGGCCTGTGGCCCATCAGGTAGTTGGTAGGGTAAGAGCCTACCAAGCCGACGACGGGTAGCCGGTCTGAGAGGATGGACGGCCACAAGGGCACTGAGAC

TaxId file: (TSV format) 81152 Bacteria;Thermotogae;Thermotogae_c;Thermotogales;Fervidobacteriaceae;A61579_g;A61579_s

Error: The reference 123073 is not found in your ref2taxid file. Please make sure that the ref2taxid matches the reference.

However it is in both files provided. I have tried all solutions suggested above. Any additional suggestions would be highly appreciated.

Thanks

nggvs commented 3 months ago

Hi @Janca-Pieters95 ,

Thank you for using the workflow! The format of the files is different from the ones you have. You can see an example here. You can also check our blog post for more information. Let me know if that works for you! In that case, please close the issue Thanks

nggvs commented 2 months ago

Hi @EsbergA , @Janca-Pieters95 @eveotj !

Thank you all for using the workflow! As I have not received more feedback on this issue, I'll close it, but please open a new one if you're still having issues! Thank you very much!

plycrsk commented 2 months ago

I'm having the same issues.

reference file example:

kraken:taxid|1965238|NC_034217.1 Pityohyphantes rubrofasciatus iflavirus isolate UW1, complete genome GTTATGACATTAGCTATTTAAACTCACTGTTTACATGTTTACTTAGTTATTCTATTATAA GAGATTTATCCACTTTCCTTTTCAATTTTGGATAGAAATTTATATAATTTCCCTATTTTA AAATAATCTCAAGGTTTTAAACCTCTTTAATTAGGACTGAAATGATTTTATTATGAAAAG TGTTTACACGCTTATTAATTTTAAATATTGTTTCTAAGAATTTAGATAATGTACCCCTAT

ref2taxid:

kraken:taxid|1965238|NC_034217.1 1965238 kraken:taxid|1690428|NC_042052.1 1690428 kraken:taxid|2660689|NC_074749.1 2660689 kraken:taxid|1476886|NC_024215.1 1476886 kraken:taxid|2560315|NC_025361.1 2560315 kraken:taxid|1048854|NC_033830.1 1048854 kraken:taxid|1048854|NC_033847.1 1048854 kraken:taxid|1048854|NC_033831.1 1048854 kraken:taxid|1922553|NC_032592.1 1922553 kraken:taxid|1923593|NC_033137.1 1923593

Length of files are the same and IDs are identical:

wc -l viral_prelim_map.tsv 18640 viral_prelim_map.tsv

grep '>' viral_library.fna | wc -l 18640

I get the same error as above the other users posted:

Error: The reference kraken:taxid|2786405|NC_074583.1 is not found in your ref2taxid file. Please make sure that the ref2taxid matches the reference. If your input are bam files, make sure that the ref2taxid matches the reference used for the mapping step.

This ID is in ref2taxid and matches.