enasequence / sequencetools

Webin sequence validation API.
Apache License 2.0
10 stars 3 forks source link

23: Introduce EBISearchEmblEntryReader, readers that keep content als… #27

Closed nbuso closed 6 years ago

nbuso commented 6 years ago

…o not validated, permit to skip html entity validation, AC not needed in coding/non-coding

probably the build.gradle needs attention, I presume you already know better than me.

Let me know if you need to clarify anything

nbuso commented 6 years ago

Hi, I tried to keep the code as separate as possible from the main code because I feel your use cases are slightly different from our; glad you feel is not modifying the main behaviour of the library. In general the aim of the modifications are to create the error but keep the value also if it's wrong. In our indexing process we can't do anything about embl, coding, non-coding validation we can only keep the data as it is because is in any case part of a release that is already public.

About your questions:

  1. I presume you do an html entity validation to ensure your submission is not adding wrong characters to the data, unfortunately there are cases where descriptions contain the '&' and/or ';' characters that trigger this validation error; and there are lots of cases.

  2. OSTollerantReader, RATollerantReader:

    1. There are cases like: 'Yersinia enterocolitica (type O:2) str. YE3094/96' that don't validate. The tolerant reader is keeping the whole text as organism name without trying to split between 'name' and 'common name'
    2. There are already cases like: 'MiguelGL.L' that don't validate, the tolerant version of reader is simply keeping the author as is without trying to split it in surname and name

Let me know if you want to modify part of the code and I can look into it. We can also provide you some error reporting with precise information about the files that are not valid; the only problem is we are indexing more times between your releases and does not make sense we provide you everytime this report, because will contain lots of repetitions. Maybe we can do it at every embl public release? let me know your thoughts