Duplication checks - Githubissues

Plant-Food-Research-Open / assemblyqc

A Nextflow pipeline for evaluating assembly quality

https://plant-food-research-open.github.io/assemblyqc/

MIT License

26 stars 4 forks source link

Duplication checks #64

Closed GallVp closed 5 months ago

GallVp commented 1 year ago

From @CeciliaDeng

BTW, we haven't checked duplicate sequences in assembly, have we? I can't remember if the QC pipeline checks for mitochondria/plastids/ribosomal rna contaminations. If not, we may list them as 'todo for future release'?

rosscrowhurst commented 1 year ago

Careful with duplicates at contig level, e.g.

exact duplicates base for base same length should not occur
imperfect duplicates - not base for base identical - can be different length - might be allelic variation
whole genome duplications - need to differentiate these from duplications - might be that some input contigs are "similar" but not identical
others .... So just need to be careful about what you are meaning by "checking duplicate sequences"

Mitochondrial & plastid are only 'contaminants' in some contexts but part of the whole genome in others (as in you have nuclear genome, organellar genomes - combined they are the whole genome) - need to ensure not to mark mitochondrial and chloroplast genomes in themselves as contaminants if they are not.

Plant nuclear genomes have chloroplast gene insertions in them - need to make sure you are not marking these regions as contamination

GallVp commented 6 months ago

Hi @CeciliaDeng

Does the above comment from Ross answer your question? Is there a tool you have in mind for duplicate detection?

CeciliaDeng commented 5 months ago

Hi @GallVp and @rosscrowhurst, We encountered duplicated sequences before in our NCBI submission, in particular de novo assemblies from short reads. Yes, the duplicated seqs are usually at contig level, with exactly the same sequence but different SeqIDs. I ran 'ml seqkit; seqkit rmdup -s -o $checkedFasta $inputFasta' to remove such items

CeciliaDeng commented 5 months ago

For genomes we downloaded from public domain, sometimes there exists duplicated seqIDs and 'samtools faidx $inputFasta' will complain and exit. In that case we can use 'seqkit rmdup -n -o $outFile $inputFasta' to remove seqs with the same ID. However their sequences could be different even with the same SeqID, in that case I usually append '.1', '.2' and so on for the sequences with the same ID.

GallVp commented 5 months ago

Thank you @CeciliaDeng

This is very useful information. I will add following to fasta validation:

All sequence ids must be unique
All sequences must be unique. A sequence is defined as the entire sequence and not a part of the sequence. The match must have 100% identity and coverage.

GallVp commented 5 months ago

We are using py_fasta_validator to validate fasta files. It does detect sequence ID duplication. Please see:

https://github.com/linsalrob/py_fasta_validator/blob/32d1d2a49da550df41d44bc61be4341cdf104ae4/PyFastaValidator/validate.py#L28