RyanCook94 / inphared

Providing up-to-date phage genome databases, metrics and useful input files for a number of bioinformatic pipelines.
GNU Affero General Public License v3.0
61 stars 8 forks source link

Non viral sequences & genome fragments #11

Closed snayfach closed 2 years ago

snayfach commented 2 years ago

I just ran CheckV on the latest inphared dataset and noticed a bunch of recently deposited contigs that are clearly non-viral based on marker genes. Here are a few examples: CAKLQF020000004 CAKLQF020000005 MW495066 CAKLQH020000006 CAKLQF020000009 CAKLQF020000008. These are 150 kb+ contigs that are littered with bacterial genes and have nearly no viral genes.

There are some other sequences that look like short genome fragments: MH319743 MH327486 ACSJ01000018 HM246723 MH319752 MH327485 MH319722 ACSJ01000015 (<10% estimated completeness).

Maybe there are other issues, but these were the first thing I spotted. Clearly this is the fault of genbank submitters, but thought it's something that could be addressed in the future.

RyanCook94 commented 2 years ago

Hi Stephen,

Thank you for flagging these. I've just added the large ones to the exclusion list, so they won't be in future releases (think that's now ~1800 dodgy genomes to exclude!).

As for the smaller ones, I'll try to run CheckV this coming week and will add anything that's obviously incomplete to the exclusion list.

Yes, unfortunately, this is very much the fault of some Genbank users making questionable submissions...

If you ever spot any more, please let me know and I'll keep adding to the list.

Many thanks!

snayfach commented 2 years ago

Hey Ryan - I only looked at ~6000 contigs, but I can certainly look at the full set and send you a list of contigs to exclude + evidence for exclusion. Don't think you need to do anything, unless you wanted to build a process to automatically remove newly deposited sequences failing QC in future releases.

snayfach commented 2 years ago

Closing for now, and will update thread in the future