enasequence / sequencetools

Webin sequence validation API.
Apache License 2.0
10 stars 3 forks source link

Parsing of non-coding flat files #23

Open nbuso opened 6 years ago

nbuso commented 6 years ago

Hi I need to parse non-coding and coding data and currently there are few problems, that I would like to address. I can create a pull request when done, but I would like to agree how to proceed.

1) AC Lines are not present in non-coding. This is check done through EmblEntryReader.EXACTLY_ONCE_BLOCKS that can be changed 2) Block RL has invalid characters. I need to understand why there is the validation against the regexp "&(?:\#(?:([0-9]+)|Xx)|([A-Za-z0-9]+));?" In this document: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt paragraph 3.4.10.8 I can't see a mention about allowed chars 3) Feature Qualifier "specimen_voucher" has invalid characters "H&B7227 (NU)". The same here there is a check against the regexp "&(?:\#(?:([0-9]+)|Xx)|([A-Za-z0-9]+));?"

but in this document: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/FT_current.html#3.3.3 I see 4 types of qualifier values and non similar to the regexp I see in the code

Is it OK in you point of view to change the regular expression to include such situations? or would you prefer to create special cases for the ncr files?

nbuso commented 6 years ago

I have also this type of errors (FF.10)

Invalid organism name "Idiomarina sp. Loihi-Chm(16S)-1" - format is expected to be of the format "Genus species (name)" - where (name) is optional line: 95089

The data come from non-coding_release r133, that seem not to respect the documentation in: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt (3.4.7 The OS Line) I presume who ever submitted the data is using the round bracket that should be reserved to identify the common name

nbuso commented 6 years ago

I noticed also this error message

CON entry must have CO(CONDIV) lines

The error is detected in EmblEntryReader (around line 308) because AGPValidation check is validating the entry and AgptoConFix set the dataclass (if I'm interpreting all correctly). If I take a look at: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt 3.4.14 The CO Line (in CON records only)

does not seem to mandate the presence of CO lines

raskoleinonen commented 6 years ago

Could you please provide us with examples of the non-coding flat files you have been working with which generate the errors described above? I think it is fastest if we just attempt to parse these flat files until we have resolved all encountered problems.

And happy new year!

nbuso commented 6 years ago

I prepared a pull request to share the code and we can provide report of the 'offending' files when you request. I presume periodically you will get duplicate information