EI-CoreBioinformatics / minos

The labyrinth king judges your gene models.
GNU Lesser General Public License v3.0
9 stars 1 forks source link

seqkit translate errors if CDS files contains sequences below 3bp #60

Closed swarbred closed 1 year ago

swarbred commented 2 years ago

https://github.com/EI-CoreBioinformatics/minos/blob/0611d65fdd2c7e3a01b939282dd597dfb6a447f6/minos/zzz/minos_run.smk#L302

ORFs present in the GFF below 3bp in length cause seqkit translate to error, I think we can avoid this by simply filtering the file to exclude very small CDS seqs i.e. adding a seqkit seq step with min cds size of 48

seqkit seq -m 48 CDS.fa | seqkit translate --threads 1 --line-width 70 -T 1

seqkit seq would be added to minos_run.run_config.yaml so -m can be changed but setting as 48 seems reasonable.

I'm not aware of this having an issue downstream i.e. there will be models with CDS but with no protein seq but this is all pre-pick and running diamond with a protein seq of 1 amino acid wont be useful anyway

gemygk commented 2 years ago

Thanks @swarbred

It is strange that we get an error with seqkit translate in such a case.

I have installed fix minos-1.9.0-dev_6b9055b to our cluster.