Question: generate marker genes from mitochondrion refseq

brutal588 commented 1 month ago

Hello,

Thank you for creating this great tool. I have successfully run read2tree using marker genes downloaded from OMA browser and my own data, which seems really great. However, I found some problems when trying to generate marker genes using oma standalone, with the ncbi refseq mitochondrion genome as reference. As I follow the wiki about obtaining marker genes for viral dataset, for mitochondrial refseq there isn't a corresponding cds .fna file for the .faa protein sequence, just the whole mitochondrion genome DNA sequence for each species. I have thought of extracting the cds sequence from the whole mitochondrion genome DNA sequence file using the start and end position provided in the .gbff file in that folder, but have some problems dealing with file name and other information. I want to ask if there's a version you could provide better solution for this task.

I'm using mitochondrion refseq from this ncbi ftp: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/

Best regards, Yang

alpae commented 1 month ago

Hi Yang,

within read2tree there is little we can do to support this out of the box at the moment. however, you can usually quite easily extract the cds and protein sequences from the gbff file yourself:

gzip.open('data.gbff.gz','rt') as fh:
    for rec in Bio.SeqIO.parse(fh, 'genbank'):
         for f in rec.features:
             if f.type != 'CDS':
                  continue
             prot = f.qualifiers['translation'][0]
             nuc = f.extract(rec)

             # write sequences to files

however, the folder you mentioned does not seem to contain the genomic sequence itself in there.

I hope this could nevertheless be useful.

brutal588 commented 1 month ago

Hi Adrian,

Thank you for the quick response.

I have successfully got the cds and protein sequences from the .gbff file along with the whole mitochondrial genome file as you suggested. Since this is the refseq containing mitochondrial sequences for many species, I'll extract those of my interest and try to run oma standalone and read2tree.

If it is not troublesome, I still have some questions to ask: 1) When running oma standalone, is each aa sequence considered as a candidate for the OGs? Which means the ref sequence must be well annotated cause we couldn't find multiple OGs from a single sequence (like a whole mitochondrial genome). 2) For the input raw reads in read2tree, should we first run a quality control step for the raw sequence fq file, especially when we are using ONT reads?

Best regards.

DessimozLab / read2tree

Question: generate marker genes from mitochondrion refseq #67