Closed robintw closed 2 years ago
Does this ring a bell @robintw ?
Yes, I remember it - but I must say I don't necessarily think it is needed.
The aim of this was to record every single token that has had data extracted from it in the database - but this would be a huge volume of data to record for every file (especially the very large ones) - and would also require some fairly significant changes to the code that deals with token extraction and highlighting.
I'm not convinced it would be worth the extra time taken on each run (which could be significant, as there could be 100,000s of rows of data to add), the storage space, or the time to implement - but you may have different views. I think the pros/cons would definitely need discussing with the clients before starting to implement.
Yes - good points. One power user (E.B) has a vision of being able to reverse-engineer emergent data issues back to their source.
So, someone finds an issue with some data, and they are able to track back to the line of text it came from. This may be more relevant in the narrative-text based processing, and wanting to see the source data in order to verify if there was a typo, or the data got mangled during import.
This feature could be optional for different file types. We should include it as a constructor option, and in the exploitation
template.
CF:
Do you want me to start working on this now @IanMayo? I see it's in the 'Ready to Go' section.
It'll be a reasonably-sized piece of work, so I won't be able to complete it before I go on holiday next Wed.
We'll also need to think about how to structure it: we want to have a way of turning it on and off for different file types, or even individual imports. We'll also need to work out how all the various bits will interact: we'll need to do the highlighting as we parse the file, but then store all the highlighted info to be able to export into the database once the file has been successfully imported (and not import it if there were errors that prevented the file being successfully imported).
Personally, I'm not 100% sure this should be a very high priority at the moment - but I'm definitely the right person to work on it, as I'm the most experienced with the highlighting/tokenising code.
Ah, just realised this was in the section for 'backlog for next sprint' not 'target for this sprint' - I'll chat to you on Slack about what to focus on next.
This looks like it'll be a bit easier than expected - as we already have some of the infrastructure needed. The key question now is what we do about the table schema for storing this in the database. Currently the schema is defined as:
class Extraction(BaseSpatiaLite):
__tablename__ = constants.EXTRACTION
table_type = TableTypes.METADATA
table_type_id = 10
extraction_id = Column(UUIDType, primary_key=True, default=uuid4)
table = Column(String(150), nullable=False)
field = Column(String(150), nullable=False)
chars = Column(String(150), nullable=False)
created_date = Column(DateTime, default=datetime.utcnow)
If we image we're recording that we've extracted a timestamp from a CSV file which is giving us State information. Presumably, the fields would be set as follows:
table:
Statesfield:
timechars:
"2020-01-05 10:30:23"However, this doesn't record which entry in the States table this data was used for (ie. the state_id
of the State this was used to create) and doesn't record where in the file the characters came from. Do you think we should extend the API to cover this?
We could do something like:
table:
Statesentry_id:
abcd1234 (This is an awful name, but I don't know what to call it)field:
timetext:
"2020-01-05 10:30:23"text_location
: ? (see below)However, I'm not sure how to deal with the location. Most of the time it will be a single range of characters (eg. from char 67 to char 89), but when we combine tokens we can end up with separate character ranges - eg. 67-89 and 101-118
- and of course, we could have any number of these ranges. I'm not sure how best we could store these in the database. We could do some sort of complicated relational design but I think that is overkill. I'm leaning towards just storing it as a string of something like 67-89, 101-118, 205-209
- that would be fairly easily parseable automatically if we needed to (just split on ,
and then on -
). What do you think?
Ah, hang on - we also need to link to the Datafile entry so we know what datafile this came from. So we should add a datafile_id
column too.
Yes - the text summary of the ranges should be satisfactory.
I think there is merit in recording what the text was interpreted as. It is possible for software to combine the current value of the field with the items listed above. But, it would be easier to inspect if we put the parsed value in here too.
Ok, to make it work across different types, it would have to be the string value of what was parsed - but that should still show us that parsing and unit assigning went ok (eg. it would say 34.5 degree
etc)
Yes - I agree.
Hmm, now I'm starting to implement this, I'm running into some issues.
When recording the extraction to the database, we need to know the Datafile
that the extraction came from, and the entry_id
and table
that the data is going into. We currently don't have access to the Datafile
in the highlighting code, but that is fairly easy to change. What is more difficult is that we won't know the entry_id
at the time that we do the recording of the token.
This is for two reasons:
create_state
method.state_id
field (the primary key) doesn't get populated until we save it to the database. We don't do this until a lot later on, after we've run all importers for the file, run the validation and decided whether we're going to import the measurements etc.I've been wondering whether we could try and link the State objects back to the Extractions at a later stage - so we have a big list of extractions, and then link those to the States once the States have been pushed to the database. That might work, but would be pretty complex and would require quite a lot of fairly major changes.
What do you think we should do @IanMayo? I think there's little point in recording the extractions if we can't link them to specific State objects, but I think it will be very difficult to do that.
Oh dear - you're right.
Since every measurement has a timestamp
we could store the timestamp. Maybe datafile and timestamp could do? Maybe not.
This isn't python/pythonic, but here is one solution.
When we parse a measurement, we provide it with a callback function. This function takes a measurement id parameter. The callback has a copy of the extraction it represents.
I presume we load data into an intermediate format. For each item in this listing, we give it the callback function. When we generate the state, we then take the state-id, and call the callback function with it.
Aah, I guess we could add the callback to the create-state
function. If it's non-null, then it's called once the state-id
is known.
Just a couple of thoughts.
Currently the extraction highlighter code (the
record
methods onLine
andToken
) record the usages internally to produce a highlighted HTML output. We should extend this so that they can also record extractions in the database.