Extend extraction highlighter to record extractions in database

robintw commented 4 years ago

Currently the extraction highlighter code (the record methods on Line and Token) record the usages internally to produce a highlighted HTML output. We should extend this so that they can also record extractions in the database.

IanMayo commented 3 years ago

Does this ring a bell @robintw ?

robintw commented 3 years ago

Yes, I remember it - but I must say I don't necessarily think it is needed.

The aim of this was to record every single token that has had data extracted from it in the database - but this would be a huge volume of data to record for every file (especially the very large ones) - and would also require some fairly significant changes to the code that deals with token extraction and highlighting.

I'm not convinced it would be worth the extra time taken on each run (which could be significant, as there could be 100,000s of rows of data to add), the storage space, or the time to implement - but you may have different views. I think the pros/cons would definitely need discussing with the clients before starting to implement.

IanMayo commented 3 years ago

Yes - good points. One power user (E.B) has a vision of being able to reverse-engineer emergent data issues back to their source.

So, someone finds an issue with some data, and they are able to track back to the line of text it came from. This may be more relevant in the narrative-text based processing, and wanting to see the source data in order to verify if there was a typo, or the data got mangled during import.

This feature could be optional for different file types. We should include it as a constructor option, and in the exploitation template.

IanMayo commented 3 years ago

CF:

really valuable for human-generated content
value from recording Pepys build version in log of import

robintw commented 3 years ago

Do you want me to start working on this now @IanMayo? I see it's in the 'Ready to Go' section.

It'll be a reasonably-sized piece of work, so I won't be able to complete it before I go on holiday next Wed.

We'll also need to think about how to structure it: we want to have a way of turning it on and off for different file types, or even individual imports. We'll also need to work out how all the various bits will interact: we'll need to do the highlighting as we parse the file, but then store all the highlighted info to be able to export into the database once the file has been successfully imported (and not import it if there were errors that prevented the file being successfully imported).

Personally, I'm not 100% sure this should be a very high priority at the moment - but I'm definitely the right person to work on it, as I'm the most experienced with the highlighting/tokenising code.

robintw commented 3 years ago

Ah, just realised this was in the section for 'backlog for next sprint' not 'target for this sprint' - I'll chat to you on Slack about what to focus on next.

robintw commented 3 years ago

This looks like it'll be a bit easier than expected - as we already have some of the infrastructure needed. The key question now is what we do about the table schema for storing this in the database. Currently the schema is defined as:

class Extraction(BaseSpatiaLite):
    __tablename__ = constants.EXTRACTION
    table_type = TableTypes.METADATA
    table_type_id = 10

    extraction_id = Column(UUIDType, primary_key=True, default=uuid4)
    table = Column(String(150), nullable=False)
    field = Column(String(150), nullable=False)
    chars = Column(String(150), nullable=False)
    created_date = Column(DateTime, default=datetime.utcnow)

If we image we're recording that we've extracted a timestamp from a CSV file which is giving us State information. Presumably, the fields would be set as follows:

table: States
field: time
chars: "2020-01-05 10:30:23"

However, this doesn't record which entry in the States table this data was used for (ie. the state_id of the State this was used to create) and doesn't record where in the file the characters came from. Do you think we should extend the API to cover this?

We could do something like:

table: States
entry_id: abcd1234 (This is an awful name, but I don't know what to call it)
field: time
text: "2020-01-05 10:30:23"
text_location: ? (see below)

However, I'm not sure how to deal with the location. Most of the time it will be a single range of characters (eg. from char 67 to char 89), but when we combine tokens we can end up with separate character ranges - eg. 67-89 and 101-118 - and of course, we could have any number of these ranges. I'm not sure how best we could store these in the database. We could do some sort of complicated relational design but I think that is overkill. I'm leaning towards just storing it as a string of something like 67-89, 101-118, 205-209 - that would be fairly easily parseable automatically if we needed to (just split on , and then on -). What do you think?

robintw commented 3 years ago

Ah, hang on - we also need to link to the Datafile entry so we know what datafile this came from. So we should add a datafile_id column too.

IanMayo commented 3 years ago

Yes - the text summary of the ranges should be satisfactory.

I think there is merit in recording what the text was interpreted as. It is possible for software to combine the current value of the field with the items listed above. But, it would be easier to inspect if we put the parsed value in here too.

robintw commented 3 years ago

Ok, to make it work across different types, it would have to be the string value of what was parsed - but that should still show us that parsing and unit assigning went ok (eg. it would say 34.5 degree etc)

IanMayo commented 3 years ago

Yes - I agree.

robintw commented 3 years ago

Hmm, now I'm starting to implement this, I'm running into some issues.

When recording the extraction to the database, we need to know the Datafile that the extraction came from, and the entry_id and table that the data is going into. We currently don't have access to the Datafile in the highlighting code, but that is fairly easy to change. What is more difficult is that we won't know the entry_id at the time that we do the recording of the token.

This is for two reasons:

The order of doing things means that we won't necessarily have created the State object by the time we extract the data. For example, the REP importer uses an entirely separate class to do the parsing and recording of tokens, and just returns those values to the REPImporter class - so at the time that we do the recording of tokens, we haven't got any information about the State/Comment/etc entry. Most other importers don't use that approach, but even then we parse and record the timestamp before we create the State entry, as we require a timestamp to call the create_state method.
Even once we create the State/Comment/etc entry, the state_id field (the primary key) doesn't get populated until we save it to the database. We don't do this until a lot later on, after we've run all importers for the file, run the validation and decided whether we're going to import the measurements etc.

I've been wondering whether we could try and link the State objects back to the Extractions at a later stage - so we have a big list of extractions, and then link those to the States once the States have been pushed to the database. That might work, but would be pretty complex and would require quite a lot of fairly major changes.

What do you think we should do @IanMayo? I think there's little point in recording the extractions if we can't link them to specific State objects, but I think it will be very difficult to do that.

IanMayo commented 3 years ago

Oh dear - you're right.

Since every measurement has a timestamp we could store the timestamp. Maybe datafile and timestamp could do? Maybe not.

This isn't python/pythonic, but here is one solution.

When we parse a measurement, we provide it with a callback function. This function takes a measurement id parameter. The callback has a copy of the extraction it represents.

I presume we load data into an intermediate format. For each item in this listing, we give it the callback function. When we generate the state, we then take the state-id, and call the callback function with it.

Aah, I guess we could add the callback to the create-state function. If it's non-null, then it's called once the state-id is known.

Just a couple of thoughts.

debrief / pepys-import

Extend extraction highlighter to record extractions in database #263