Valid CDS identifier? - Githubissues

aberaslop commented 3 years ago

Hi Cameron,

Congratulations on this software! I find it very nice and would like to run it for my research. I have successfully installed it, but when I try to run it, I get the following error:

[07:26:04] INFO - Starting clinker [07:26:04] INFO - Parsing GenBank files: ['S1_contig_41.region001.fixed.gbk', 'S2_contig_9.region001.fixed.gbk', 'S3_contig_26.region001.fixed.gbk', 'S4_c00019_NODE_19...region001.fixed.gbk', 'S5_c00001_NODE1...region001.fixed.gbk', 'S6_c00005_NODE5...region001.fixed.gbk', 'S7_scaffold2.region001.fixed.gbk'] Traceback (most recent call last): File "/home/aberas2/miniconda3/bin/clinker", line 8, in sys.exit(main()) File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/main.py", line 153, in main hide_alignment_headers=args.hide_aln_headers, File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/main.py", line 49, in clinker clusters = parse_files(paths) File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 49, in parse_files return [parse_genbank(path) for path in paths] File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 49, in return [parse_genbank(path) for path in paths] File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 32, in parse_genbank cluster = Cluster.from_seqrecords(*records, name=path.stem) File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 83, in from_seqrecords loci = [Locus.from_seqrecord(record) for record in args] File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 83, in loci = [Locus.from_seqrecord(record) for record in args] File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 124, in from_seqrecord for feature in record.features File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 125, in if feature.type == "CDS" File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 191, in from_seqfeature "Could not determine a valid identifier" ValueError: Could not determine a valid identifier from a CDS SeqFeature in c00019_NODE_19..

I have tried to remove this file that errors, and run the program. But I get the same error with the next file in line. I think it has to deal with the formatting in all my files. I would very much appreciate if you could advice me on how to fix this problem. I have uploaded one gbk file, so that you can have an idea of what they look like.

[S4_c00019_NODE_19...region001.fixed.gbk.txt] (https://github.com/gamcil/clinker/files/5501587/S4_c00019_NODE_19.region001.fixed.gbk.txt)

Thank you so much!

L.

hungenlai90 commented 3 years ago

First of all thanks so much for writing this program, it's really user-friendly to non-bioinformatics experts like me. I managed to get it running in just a few steps. I really like how easy it is to use this and the visualisation is good enough for publication. However, I have the identical error message as above when running the script against 49 ripps clusters gbk files. When I removed the problematic files, the script ran fine (three of the 49 files were problematic). I'm not sure what causes the issue in these three files, I have tried trimming down the length of the dna sequence, removing extra features annotated by antismash that weren't important to my gene cluster, but still had no luck getting the script to run the alignment for these files. As the gbk sequences/files are not public release, I can't share them here unfortunately.

hungenlai90 commented 3 years ago

Did more troubleshooting and I found that removing three of the CDS features in one file (doesn't work if you only remove any one or two of them) allowed the script to run without error. I tried removing another set of three CDS features but that didn't work either. It's very weird indeed...

rob2go commented 3 years ago

I also got the same error when using genbank files generated by SnapGene. If i download directly from NCBI, it works perfectly. The thing is it has to have the Gene and CDS annotations gene 1..1716 /locus_tag= /note= CDS 1..1716 /locus_tag=

Snapgene was not generating the genes because I have not annotated them... But I am also worried how to do with the genomes I have that are also not public yet. I have to generate them somehow and the error will be there, probably. I don't know what else to do. Just trying to figure it out by comparing gbk files I generate with those from NCBI.

hungenlai90 commented 3 years ago

In my gbk files all of them have no gene annotation (just CDS, misc_feature and primer_bind). They ran fine without error, so not sure if that is the cause of this issue?

gamcil commented 3 years ago

Hey everyone,

I think this is because clinker currently only checks for protein_id, locus_tag and ID qualifiers to use as gene names. In @aberaslop's file for example, the features have these instead:

/Name="input.path1.gene38"
/gene="input.path1.gene38"

The quick fix would then just be do a search and replace on the problematic files (i.e. change Name= to protein_id=). When I get some time I'll add some extra qualifiers for it so you shouldn't have this problem.

As far as features go, clinker only looks for CDS, so you shouldn't need any gene/mRNA etc.

aberaslop commented 3 years ago

Hi Cameron,

Thank you for such a quick answer and solving my issue ! Changing /Name by /protein_id totally fixed the problem.

Thank you so much!!

L.

sjmoore505 commented 3 years ago

This is a cool tool and straightforward to use - thanks to @hungenlai90 for heads up

gamcil commented 3 years ago

This should now be fixed in 0.0.7. clinker should now save most common name qualifiers, though if it's missing some/erroring feel free to re-open this issue.

gamcil / clinker

Valid CDS identifier? #4