GBK file error - Githubissues

amcomeau commented 10 months ago

Hello and thanks for the potentially interesting software. I'm doing a first test run on my smallest contig file from a metagenome assembly (meta-spades) and, after getting through some initial setup problems (table2asn separate install, Bio::SearchIO::hmmer error, etc.), Metascan appears to now be running and seemed to get near the end, but it ran into the following problem:

[02:11:09] BLASTing remaining proteins
[02:11:10] Labelling remaining 629 proteins as 'hypothetical protein'
[02:11:10] Adding /locus_tag identifiers
[02:11:11] Assigned 137834 locus_tags to CDS and RNA features.
[02:11:11] Writing outputs to ./KHNCIEKH/
[02:11:23] BLAST-ing 16/5/28S
[02:11:23] Will use blastn to search against /home/andre/bin/NCBIrrnadb/16S_ribosomal_RNA with 8 CPUs
[02:11:23] Deleting unwanted file: ./KHNCIEKH/16S_ribosomal_RNA.blastn
[02:11:23] Deleting unwanted file: temp.txt
[02:11:23] Generating annotation statistics file
[02:11:24] Generating overview files

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not read file './KHNCIEKH/KHNCIEKH.gbk': No such file or directory
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:449
STACK: Bio::Root::IO::_initialize_io /usr/share/perl5/Bio/Root/IO.pm:272
STACK: Bio::SeqIO::_initialize /usr/share/perl5/Bio/SeqIO.pm:508
STACK: Bio::SeqIO::genbank::_initialize /usr/share/perl5/Bio/SeqIO/genbank.pm:235
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:384
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:430
STACK: Bio::SeqIO::newFh /usr/share/perl5/Bio/SeqIO.pm:452
STACK: /home/andre/bin/metascan/metascan.pl:1994
-----------------------------------------------------------

Any ideas? My MGS samples have already been run through our custom MicrobiomeHelper MGS pipeline (which uses Kraken2 + MMseqs2) and so I may be forced to just go directly from you original lists of KO numbers you are defining your different cycles with...I could then just grep them out of the already assigned RPKM tables I have, but I wanted to see if your "post-assembly automated way" would work first. Thanks!

ke8ti commented 10 months ago

Hi. I'm getting the same error with a missing .gbk file. Any ideas how to solve this? Thanks!

gcremers commented 10 months ago

From the top of my head, this seems to be caused by bioperl trying to convert the gbk file to an embl file. But there is no gbk file formed. Which is most likely caused by table2asn. Which can have a number of reasons.

So now, I am creating a conda environment, since the table2asn problem is causing a lot of difficulties upstream. That should solve that problem. I am now testing the environment, I hope to be able to put it up soon.

gcremers commented 9 months ago

The environment is ready. That should solve the problem.

amcomeau commented 2 months ago

Great - I have the newest YAML and things seem to be running, but now I'm encountering a different error/problem. I've gone back to trying to use Metascan on an unassembled set of NextSeq MGS data (wanting to find all norX+nxrAB within a WWTP MGS set) and so I'm using the following command:

metascan input_fasta/ . --cpus 40 --mincontiglen 100 --centre IMR --compliant --force

...however, the program splits out a whole bunch of lines saying it is ignoring <200 bp:

[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:55648:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:19481:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:52391:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:50838:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:51028:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:59321:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[15:42:34] Skipping short (<200 bp) contig: VH01194:87:AACHY3GHV:1:1101:59283:54426:N:0:AGGACAGGCC+AGCTGGAATG#0/1

...even though I put the min length of 100 there to make sure it processes all the 150+ bp reads of Illumina. Any ideas?

amcomeau commented 2 months ago

As a follow-up, I tested and the same thing happens even if I use --mincontiglen 1.

gcremers commented 1 month ago

Could you try without using the --compliant flag? It overrides the --mincontiglen, setting it back to 200bp.

amcomeau commented 1 month ago

Thanks - I missed that labelling in the help where it said it changes the other flags when you select --compliant! Seems to be working OK right now - it is processing.

Not sure why I turned the compliant flag on - this morning I thought it was because I read something in the forum posts here about needing it to bypass some problem, but I don't see it now.

PS: Would be nice if you could have instruction on how to run Metascan on the initial README page of the Github site, with examples, explanations of the flags, etc. There is a post in the Issues that people can eventually find, but people expect to see the instructions on the first page of a Github site.

amcomeau commented 1 month ago

Ah, now I know why I turned on the compliant flag...I get an error when running:

[12:17:00] Contig ID must <= 37 chars long: VH01194:87:AACHY3GHV:1:1101:18761:1151:N:0:AGGACAGGCC+AGCTGGAATG#0/1
[12:17:24] Please rename your contigs OR try '--centre X --compliant' to generate clean contig names for fasta file: MLH_A_08092023_S120_L001.assembled_kneaddata.fastq.fasta

...presumably since these are NextSeq reads and not assembled contigs, but Metascan is supposed to be able to handle scanning a collection of MGS reads, right?

gcremers commented 1 month ago

Metascan was not written with reads in minds. The input is meant to be assembled contigs, but I am curious to see if it handles reads as well.

The reason it throws that error is that the reads names are too long. Changing the names of the to something smaller than 37 characters should also fix it.

amcomeau commented 1 month ago

OK, I did the rename (with seqtk for those interested) and seems to be running fine now - I'll come back to you here once I get some results. As a side note, I'm trying this out since my first attempt of using GenSeed-HMM gave lacklustre results when I used my own HMMs of nxrA+B and norX, so I'm hoping Metascan pulls out more hits.

[cid:34a6deb3-0245-4059-bf3a-d1d37082d7cf]https://imr.bio

ANDRÉ M. COMEAU, PhD Manager • Integrated Microbiome Resource (IMR) T: 902.494.2684 | E: @.**@.>

Address for deliveries: Dept. of Pharmacology Tupper Med. Bldg., room 5D Dalhousie University 5850 College St. Halifax NS B3H 4R2

[cid:8bb5051e-20ff-4b54-a3d6-331e4bdd4b05]http://morganlangille.com

Research Associate (Lab Manager)

Morgan Langille Labhttp://morganlangille.com • Dept. of Pharmacology ResearchGate Profilehttp://www.researchgate.net/profile/Andre_Comeau • GoogleScholar Publications http://scholar.google.ca/citations?hl=en&user=-K-N4ssAAAAJ

"Without fantasy, there is no science. Without fact, there is no art." - Nabokov "The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson

From: gcremers @.> Sent: September 23, 2024 8:54 AM To: gcremers/metascan @.> Cc: Andre Comeau @.>; Author @.> Subject: Re: [gcremers/metascan] GBK file error (Issue #4)

CAUTION: The Sender of this email is not from within Dalhousie.

Metascan was not written with reads in minds. The input is meant to be assembled contigs, but I am curious to see if it handles reads as well.

The reason it throws that error is that the reads names are too long. Changing the names of the to something smaller than 37 characters should also fix it.

— Reply to this email directly, view it on GitHubhttps://github.com/gcremers/metascan/issues/4#issuecomment-2367993243, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACH3HCO7E27IOCSH32LFAY3ZX76PPAVCNFSM6AAAAABOQRBY6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXHE4TGMRUGM. You are receiving this because you authored the thread.Message ID: @.***>

amcomeau commented 1 month ago

Metascan got killed on our server due to memory overload (we have 256 Gb) on a 15 Gb FASTA file...what is the command to see the options/flags for Metascan? Using the typical -h or --help does not work (nor just typing metascan). This should really be in your first page of the documentation. Does the script have a flag for limiting memory usage?

gcremers commented 1 month ago

In order to get the help info you need to enter an input directory between the --flag and the metascan command.

I couldn't manage to get it working otherwise when I started writing Metascan, but recently I figured out how to solve this. So this quirk should be fixed in the next version. I'm currently fixing some other issues that popped up, after that I can update Metascan. Along with that, the documentation gets the additional content as well.

As for the limiting memory possibility, there is none, I'm afraid. At the moment I also wouldn't know how to implement this.

amcomeau commented 1 month ago

OK - there are ways to limit memory native to Linux (as using commands inside of ulimit[_]), so I might play with those. However, in the meantime I decided to assemble the MGS with metaSPAdes and then run on the contigs - this file is much smaller and so is running right now.

gcremers / metascan

GBK file error #4