Open nbuso opened 6 years ago
I have also this type of errors (FF.10)
Invalid organism name "Idiomarina sp. Loihi-Chm(16S)-1" - format is expected to be of the format "Genus species (name)" - where (name) is optional line: 95089
The data come from non-coding_release r133, that seem not to respect the documentation in: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt (3.4.7 The OS Line) I presume who ever submitted the data is using the round bracket that should be reserved to identify the common name
I noticed also this error message
CON entry must have CO(CONDIV) lines
The error is detected in EmblEntryReader (around line 308) because AGPValidation check is validating the entry and AgptoConFix set the dataclass (if I'm interpreting all correctly). If I take a look at: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt 3.4.14 The CO Line (in CON records only)
does not seem to mandate the presence of CO lines
Could you please provide us with examples of the non-coding flat files you have been working with which generate the errors described above? I think it is fastest if we just attempt to parse these flat files until we have resolved all encountered problems.
And happy new year!
I prepared a pull request to share the code and we can provide report of the 'offending' files when you request. I presume periodically you will get duplicate information
Hi I need to parse non-coding and coding data and currently there are few problems, that I would like to address. I can create a pull request when done, but I would like to agree how to proceed.
1) AC Lines are not present in non-coding. This is check done through EmblEntryReader.EXACTLY_ONCE_BLOCKS that can be changed 2) Block RL has invalid characters. I need to understand why there is the validation against the regexp "&(?:\#(?:([0-9]+)|Xx)|([A-Za-z0-9]+));?" In this document: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt paragraph 3.4.10.8 I can't see a mention about allowed chars 3) Feature Qualifier "specimen_voucher" has invalid characters "H&B7227 (NU)". The same here there is a check against the regexp "&(?:\#(?:([0-9]+)|Xx)|([A-Za-z0-9]+));?"
but in this document: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/FT_current.html#3.3.3 I see 4 types of qualifier values and non similar to the regexp I see in the code
Is it OK in you point of view to change the regular expression to include such situations? or would you prefer to create special cases for the ncr files?