SuffolkLITLab / docassemble-ALWeaver

A tool to help quickly generate draft interviews from an existing document (pdf or DOCX) for the docassemble platform.
https://apps-test.suffolklitlab.org/start/ALWeaver/assembly_line/#/1
MIT License
19 stars 6 forks source link

PDF 'else' people-finding regex brainstorm #205

Closed plocket closed 3 years ago

plocket commented 3 years ago

[May have found an answer, though it needs thorough testing. Not sure how I did it without a positive look ahead, but...]

Working towards this challenge:

  1. ~(.*?)(?:(_mail_address)|(_address)|(_phone))$. See and test the use of this regex~. [Regex101 failed me. Showed it to me as working when a reload showed it wasn't.]
  2. (.+?)((_mail)|(_phone))?((_mail)+|(_address)+|(_phone)+)$ https://regex101.com/r/NlvtQa/8/ [(Unfortunately means splitting up our words more and I'm not sure how scaleable that is, but it might be. Also, matches 'z_mail_mail_address`. Is that correct or not?)]
  3. Possibly promising if we don't want those middle examples: (.+?)(?:(?:_mail_address)|(?:_address)|(?:_phone))+$ https://regex101.com/r/NlvtQa/10 if the behavior with those middle lines is as desired.
  4. Most promising capturing only the prefix: (.+?)(?:(?:_mail_address$)|(?:_mail_address_address$)|(?:_address$)|(?:_address_address$)|(?:_phone$)) https://regex101.com/r/NlvtQa/14/
  5. Most promising capturing the prefix and the suffix: (.+?)((_mail_address$)|(_mail_address_address$)|(_address$)|(_address_address$)|(_phone$)) https://regex101.com/r/NlvtQa/13
  6. There may not be a good regex solution, though we may be able to work on clarifying the code a bit.
plocket commented 3 years ago

May have found an answer to this in 4 or 5:

  1. Most promising capturing only the prefix: (.+?)(?:(?:_mail_address$)|(?:_mail_address_address$)|(?:_address$)|(?:_address_address$)|(?:_phone$)) https://regex101.com/r/NlvtQa/14/
  2. Most promising capturing the prefix and the suffix: (.+?)((_mail_address$)|(_mail_address_address$)|(_address$)|(_address_address$)|(_phone$)) https://regex101.com/r/NlvtQa/13
  3. [Add a ?:, regex101, in there for the totality of the second group to avoid the extraneous found group]
plocket commented 3 years ago

Actually did end up needing to use this, so glad it's sorted! Used 6 to capture each group.