InformaticsGenomicMedicine / DraftCoreDataModel

Draft of CoreDataModel
Apache License 2.0
0 stars 0 forks source link

Sequence (allele) validation should be type-specific #7

Closed rrfreimuth closed 12 months ago

rrfreimuth commented 1 year ago

https://github.com/InformaticsGenomicMedicine/DraftCoreDataModel/blob/79450932a44ee677816f9c02e6f5660ea03e5ec1/src/core_variant.py#L145C24-L165

This method needs to be more flexible. Please:

  1. Add a dictionary of regex patterns (all case-insensitive), corresponding to each sequence type A. DNA: ^[ACGT]*$ B. RNA: ^[ACGU]*$ C. PROTEIN: use 1-letter IUPAC codes
  2. Select the appropriate pattern based on input seq type
  3. Throw an exception if the input contains a character that isn't in the allowed list

Note that an empty string should be checked separately (and before the regex for efficiency: if empty string then return). The regex cannot include \s because it would permit spaces in the middle of the sequence.

SalemBajjali commented 1 year ago

Further discussion is needed regarding the reference allele and alternative allele. Please refer to issue #8.

I made changes to the code. Originally, the validation step was used for both the reference allele and alternative allele. I split this step into two and added the regular expression stated above. Additionally, I included another regular expression for the reference allele because the SPDI expression allows digits.

https://github.com/InformaticsGenomicMedicine/DraftCoreDataModel/blob/66f815799c6c614970c4caed71740a6bd41ce2e6/src/core_variant.py#L152-L211

rrfreimuth commented 1 year ago

The validation routine should use only the appropriate pattern given the type of sequence. For example, if the sequence (allele) is DNA, then the pattern for RNA and protein should not be used. This means the sequence type probably needs to be passed as a param.

Note that the logic on line 179 will cause an exception to be thrown if the input does not match the first pattern tested (digits). The logic should be if (not match digits) and (not match seq-specific pattern) then throw exception.

SalemBajjali commented 12 months ago

Reviewed with @rrfreimuth.