IAU-ADES / ADES-Master

ADES implementation based on a master XML file
26 stars 7 forks source link

Invalid trksub in xmltompc80col #50

Closed federicaspoto closed 3 months ago

federicaspoto commented 5 months ago

@stevechesley, we noticed that the script that converts XML to 80-column format was sometimes stopping and reporting a 'invalid trksub' comment.

We have found two reasons: I fixed one, but I am not sure about what we want to do about the second one. In both cases, the XML would pass the ADES validation.

  1. One example of invalid trksub was gb00680. This is actually a valid trksub. I noticed that in packUtil.py there was this regex r'|[A-HL-OQ-SU-Z][A-Za-z0-9]{0,6}' + # anything seven not starting with I-K,P or T, which is not completely correct because it is not allowing any lower case letters at the beginning. The correct one should be r'|([A-HL-OQ-SU-Z]|[a-z])[A-Za-z0-9]{0,6}' + # anything seven not starting with I-K,P or T. I already fixed it in a branch.
  2. A second example of invalid trksub happens when people use submit provIDs under trksub label, e.g. <trkSub>K23X07X</trkSub>. The problem is that this format would pass validation, but then the conversion stops because of the previous regex.

This is not a problem for our ingestion system, even though not completely correct. Also, note that these examples have been submitted in ADES 2017, which means that they would have not been checked anyway.

What would you like to do? Do you want me to change the regex in the packUtil script so that it will accept this trksubs, or would you prefer to reject these observations at the validation level?

Thanks! Federica

(cc @matthewjohnpayne)

stevechesley commented 5 months ago

@federicaspoto After reading the code in packUtil.py and this issue, I'm a bit confused as to what should be done. In particular, I'm wondering if you can specify exactly what should be allowed for trkSub, both in ADES and in obs80 (if different). The code seems to indicate that, among other things, a packed provisional designation is not allowed for trkSub. This may be documented somewhere, but I'm not sure where.

Looking at the documentation, the defining standard requires that trkSub be "up to 8 alphanumeric characters" (A-Za-z0-9_) But adesmaster.xml (and the associated auto doc) allows also a hyphen -. And for non-submission it allows several other characters (- ?+@.()/\\), presumably to cover past cases already submitted.

For xmltompc80col, I think that provID and permID should be written in columns 1-12, and in that case trkSub, if present should be neglected in obs80. If neither permID nor provID are present then trkSub would go in columns 5-12, and there would no (or few) restrictions on what it could be, beyond the "up to 8 alphanumeric characters" specified in the documentation. If there are additional restrictions, e.g., cannot start with a numeral or cannot be a packed provID, then we should document that and enforce it.

In any case, I'm not sure that it makes sense to enforce these restrictions in xmltompc80col. It seems that nothing in a (converted-from-XML) obs80 file will be broken if an arbitrary alphanumeric trkSub appears in columns 5-12.

federicaspoto commented 5 months ago

@stevechesley a couple of thoughts:

So in conclusion, if you agree, I can add the same checks we have in the XSD file to xmltompc80col, this should make everything works consistently. I'll find a place to document that people should not use provID in the trkSub field. I also have to say that this doesn't happen very often.

Thanks! Federica

stevechesley commented 5 months ago

@federicaspoto I agree with your plan.

Not sure if there is a reason to say that provID values cannot be in trkSub.

federicaspoto commented 5 months ago

@stevechesley sounds good! I'll open a PR when this is done.

We can always suggest to use the right fields for the submissions, I just need to find where.

federicaspoto commented 3 months ago

@stevechesley this has been resolved: see https://github.com/IAU-ADES/ADES-Master/pull/58. I'll close the Issue.