Closed GallVp closed 5 months ago
Careful with duplicates at contig level, e.g.
Mitochondrial & plastid are only 'contaminants' in some contexts but part of the whole genome in others (as in you have nuclear genome, organellar genomes - combined they are the whole genome) - need to ensure not to mark mitochondrial and chloroplast genomes in themselves as contaminants if they are not.
Plant nuclear genomes have chloroplast gene insertions in them - need to make sure you are not marking these regions as contamination
Hi @CeciliaDeng
Does the above comment from Ross answer your question? Is there a tool you have in mind for duplicate detection?
Hi @GallVp and @rosscrowhurst, We encountered duplicated sequences before in our NCBI submission, in particular de novo assemblies from short reads. Yes, the duplicated seqs are usually at contig level, with exactly the same sequence but different SeqIDs. I ran 'ml seqkit; seqkit rmdup -s -o $checkedFasta $inputFasta' to remove such items
For genomes we downloaded from public domain, sometimes there exists duplicated seqIDs and 'samtools faidx $inputFasta' will complain and exit. In that case we can use 'seqkit rmdup -n -o $outFile $inputFasta' to remove seqs with the same ID. However their sequences could be different even with the same SeqID, in that case I usually append '.1', '.2' and so on for the sequences with the same ID.
Thank you @CeciliaDeng
This is very useful information. I will add following to fasta validation:
We are using py_fasta_validator to validate fasta
files. It does detect sequence ID duplication. Please see:
From @CeciliaDeng
BTW, we haven't checked duplicate sequences in assembly, have we? I can't remember if the QC pipeline checks for mitochondria/plastids/ribosomal rna contaminations. If not, we may list them as 'todo for future release'?