OpenTechStrategies / torque-sites

Open source code specific to OTS-managed Torque sites (usually client sites).
3 stars 1 forks source link

Passing structured formatted data #101

Open slifty opened 3 years ago

slifty commented 3 years ago

The Use Case

We rely on CSVs for passing data into torque. This makes plenty of sense, but it also limits the amount of (basic) formatting that can be assigned to that data. For instance, if a diligence report involves bulleted lists, or bolded / emphasized text within a given sentence.

There is an additional issue that some of the diligence / followup documents have lots of different sections, and asking folks to paste text into csvs feels a bit clunky (since spreadsheets aren't really intended for long form multi-paragraph text blocks).

The non-engineered solution to this involved taking PDF / Word documents, using ghostscript and pandoc to extract text, and manually inserting the data into the system. This was a fraught process (but resulted in a deeper understanding of the data which is always nice).

An Engineered Solution

What I'm working on now is just a first draft at an engineered solution. We'll iterate over time I'm sure.

  1. I've created very simple word templates which leverage the header word formatting types to demark various sections and subsections.

  2. PDF and other long form data is inserted into these word "templates" manually and provided to OTS.

  3. As part of ETL, I use pandoc to convert them to plain text (markdown or wikimedia).

  4. Remove any random HTML that got inserted (sometimes indentation causes trouble)

  5. Do some basic string replacement to convert the document to archieML format. Thus each section becomes semantically accessible.

  6. Convert that structured object into the CSV torque expects.

At that point, it's all just Torque data like anything else.

slifty commented 3 years ago

Here's a sample (fabricated) output of the docx => txt template conversion:

= Fashion Review =
== Overview ==

Overall the sweat pants and t shirt while working from home look is a tried and true outfit and we have no concerns.

== Shirt ==
===Description===
The shirt is definitely in need of ironing.

===Color===
White

===Size===
L

===Type===
It's a t shirt!

== Pants ==

The outfit involved blue sweat pants, which offered comfort and a basic amount of warmth.
frankduncan commented 3 years ago

While I totally oppose this on the grounds that it moves us further and further from the true origins of the project, this is probably quite correct. We should most likely do this once we have the postgres version of torque in production, and we should probably make the api less "UPLOAD ALL OF THE THINGS" and make it more piecemeal, for instance, in expanding the -p option with an option that says "don't upload TOCs, but rather just proposal data" for faster turnaround time on new data to only a handful of proposals.

As an aside, the current csv upload has a "json" type for some columns, which has been used to upload tabular data (see the financial data adder), so we're basically already there.

slifty commented 3 years ago

One maybe vital note: I'm treating ALL of this as one big series of preprocessing steps -- CSV is the ultimate output that gets piped to torque so hopefully that's a good enough homage 😂

slifty commented 3 years ago

I'm thinking ArchieML is probably too much -- really I just want to go through line by line and whenever there's a new header level (= => ==) the parser should spin up a new key and assign it to some object.

Gonna whip up a quick script for this.