How to get MetaCOXI_Seqs.fasta in DNAFASTAFormat

kcbeng2022 commented 2 years ago

Hi @bachob5 @SantamariaMonica,

MetaCOXI is a very nice and useful database, thank you for developing it. I would like to use the MetaCOXI_Seqs.fasta and MetaCOXI_Taxonomy_Metadata.tsv as input files in QIIME2. I tried this;

wget https://zenodo.org/record/5914195/files/MetaCOXI_Seqs.tar.gz wget https://zenodo.org/record/5914001/files/MetaCOXI_Taxonomy_Metadata.tar.gz

tar -xf MetaCOXI_Seqs.tar.gz tar -xf MetaCOXI_Taxonomy_Metadata.tar.gz

qiime tools import --type 'FeatureData[Sequence]' --input-path MetaCOXI_Seqs.fasta --output-path MetaCOXI_Seq_database.qza

but got this error message; There was a problem importing MetaCOXI_Seqs.fasta:

MetaCOXI_Seqs.fasta is not a(n) DNAFASTAFormat file:

Invalid character 'I' at position 1 on line 1425456 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTRYKMSWBDHVN.

Do you have any ideas on how to resolve this?

Thank you!

bachob5 commented 2 years ago

Hi @kcbeng2022, Thanks for your feedback, it is well appreciated! I just tried to download MetaCOXI_Seqs.tar.gz and extract it as you mentioned! I could successfully parse it with biopython without any error. I also tried to grep the '|' character with zero match. I am not sure whether due to some internal operation I assume that Qiime2 has modified somehow the file! At the moment I didn't try to run Qiime2, but if your error persists I can try to do so in the next couple of days!

Let me know,

Cheers!!

kcbeng2022 commented 2 years ago

Hi @bachob5, Thanks for your response! I suggest you try importing the fasta file into Qiime2 and see if you get a similar error. This is a fantastic database and many Qiime2 users will be very interested in using it, provided it is compatible. Is it also possible to have the taxonomy file in Qiime2 compatible format? I have an eDNA dataset of marine metazoans and i am very eager to use your dataset for taxonomic assignment of my ASVs. Do you know which bioinformatic pipeline is compatible with your dataset? I read your paper but did not any case study on how you applied the database.

Many thanks!

bachob5 commented 2 years ago

Hi @kcbeng2022, Thanks for the feedback! Now the sequence FASTA file has been fixed. I uploaded the new version with a new link and I imported it successfully in Qiime2! As for the taxonomy, I have provided the metadata file including taxonomy in 'tsv' format in order to be easily parsable. However, if you have difficulty to extract the taxonomy path from it let me know I can send you a python script that can do the job. In such case, please provide me what each column should contain so I can quickly write the script.

We haven't try MetaCOXI on a specific use case (it is in future plans) because the idea behind it was to provide a curated collection with standard formats. In this way, different algorithms can be used over immediately or with few processing commands such as Usearch, Blast, MOTHUR or even Qiime2.

Let me know if I can help, Cheers!

kcbeng2022 commented 2 years ago

Hi @bachob5 Thanks for fixing the sequence file. It works perfectly in Qiime2!

Here is an example for a Qiime2 sequence file; https://data.qiime2.org/2022.2/tutorials/training-feature-classifiers/85_otus.fasta and its corresponding taxonomy file; https://data.qiime2.org/2022.2/tutorials/training-feature-classifiers/85_otu_taxonomy.txt

The taxonomy file has two columns (sequence name and taxonomy) and no header, It can be in tsv or txt.

Thanks for your help with the python script. I don't have experience with parsing. Perhaps we could work together at some point to validate your database using real-world metabarcoding samples. I have several COI metabarcoding datasets.

Cheers!

bachob5 commented 2 years ago

Hi @kcbeng2022,

Thanks for the example files. I just uploaded a python script (link: scripts/formatTaxonomy4Qiime.py) that should do the job by formatting a Qiime2 taxonomy file format. the documentation on how to execute it is available within the script!

Yes sure we can collaborate to test some use cases! For now, let me know if the script works for your case.

Cheers!!

bachob5 / MetaCOXI

How to get MetaCOXI_Seqs.fasta in DNAFASTAFormat #1