enasequence / sequencetools

Webin sequence validation API.
Apache License 2.0
10 stars 3 forks source link

Genbank parser #34

Closed jjkoehorst closed 6 years ago

jjkoehorst commented 6 years ago

I have used the genbank parser in the past but with the later versions it seems that something might have changed as it does not find the features and sequences in a genbank file.

2018-06-08 13:21:05 DEBUG Logger:144 - Genbank
2018-06-08 13:21:06 DEBUG Logger:224 - Reading new entry
2018-06-08 13:21:06 WARN  Logger:259 - Unknown line type: ID   NZ_BBIY01000170; SV 1; linear; genomic DNA; CON; XXX; 503 BP.
2018-06-08 13:21:06 WARN  Logger:259 - Block uk.ac.ebi.embl.flatfile.validation.FlatFileOrigin@74f5ce22 must occur exactly once
2018-06-08 13:21:06 WARN  Logger:259 - Block uk.ac.ebi.embl.flatfile.validation.FlatFileOrigin@25aca718 must occur at least once
2018-06-08 13:21:06 WARN  Logger:259 - Block uk.ac.ebi.embl.flatfile.validation.FlatFileOrigin@16fdec90 must occur exactly once
2018-06-08 13:21:06 WARN  Logger:259 - Block uk.ac.ebi.embl.flatfile.validation.FlatFileOrigin@1afdd473 must occur exactly once
2018-06-08 13:21:06 DEBUG Logger:229 - Features size: 0
2018-06-08 13:21:06 WARN  Logger:358 - No primary accession available using: 3e9a7173558def2e0d39715d84c28a91

I use the new GenbankEntryReader(reader); and cast it to EntryReader so that it can be generically used in the rest of the code but the entry seems to be relatively empty. Any ideas what might be the cause?

I am using:

compile group: 'uk.ac.ebi.ena.sequence', name: 'embl-api-ff', version: '1.1.210'

jjkoehorst commented 6 years ago

For speed improvements we break down the input files into smaller chunks, not realising it then saves it as EMBL file causing the GenbankReader to become confused about the input data.