Decode and deduplicate tags in the catalog with a TargetedReingestionDAG

WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

https://openverse.org

MIT License

254 stars 202 forks source link

Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452

Open krysal opened 5 months ago

krysal commented 5 months ago

Problem

In #4143, @obulat proposed to add new cleaning steps to the fix tags in the Catalog, but the option of including them in the Ingestion Server was declined in favor of using the batched_update DAG.

It is important to fix the encoding quickly because it can cause a gray Nuxt error screen for pages that contain tags with character sequences that cannot be URI-encoded (such as udadd from "ciudaddelassiencias").

Description

We want to take the functions that were planned to include in said PR and translate them into parameters for this DAG. Given the complexity of the decoding transformation it might require some advanced functions of PostgreSQL, like a combination of pattern matching and PL/Python Functions.

In #1566, duplicated tags were previously removed, so we will apply the same solution given the decoding may cause new duplicates.

Additional context

Related to #4125 and #4199 (similar issue).

sarayourfriend commented 4 months ago

I tried to run the DAG again today, and it turns out RDS does not support the PL/Pythonu extension :disappointed:

RDS does support plperl, plrust, plpgsql, and plv8. Of those, plv8 (v8 being the same JS engine as Chrome), might be the most proximate for this use case, but I'll see. It might be that the reingestion of these records is the easiest, most reliable way to do it.

sarayourfriend commented 4 months ago

@stacimc can you chime in on whether you think it's okay to go ahead and start thinking about how we would reingest these specific records identified as have potential unicode escape literals?

I think that's the safest approach anyway, because even an escaped literal unicode sequence could be valid metadata from the source. So even escaped literal sequences might be wrong to assume were caused by some previous bad transformation on our end. Most of them certainly were, we know there's some historical occurrence of that, but if even one isn't, then we've just mangled that data with no easy way to get it back (other than reingesting it anyway). And the only way to be sure will be to reingest those records.

Can you give any hints or tips of things you think need to be looked out for when thinking about this? The general idea I shared in the initial PR is here, for reference: https://github.com/WordPress/openverse/pull/4475#issuecomment-2167203047

Basically: I don't think a DAG like the one I merged in that PR is actually safe. This isn't a safe or reliable thing to infer based on our data alone. Therefore, to fix this, we need to reingest the records.

A temporary fix for instances like https://openverse.org/image/3e3bc924-a7c0-4a6b-bcea-4a17160355d2 is to only apply the frontend's automated fix for works with indexation dates before a certain date (lets say 2021-01-01, based on Olga's query in this comment).

stacimc commented 4 months ago

@sarayourfriend if the only way to safely determine what the tags should be is to reingest, then I think that's fine.

The Flickr reingestion DAG is absolutely not reliable for fixing this for a variety of reasons, but I do think we could implement something like you've described using a lot of existing code from the ProviderDataIngester and the loader tasks.

Actually I would look into extending the existing FlickDataIngester. You can add create_preingestion_tasks and create_postingestion_tasks hooks (see iNaturalist for an example) to the FlickrDataIngester for creating a temp table where you select the foreign_identifiers of the rows you want to update, in the style of the batched_update DAG, and eventually drop that table. Then modify the new ingester class to select batches of ids from this table and query the individual record endpoint from Flickr. You might need to tinker with the base class a little but I think this should all be doable pretty easily. The advantage is that records all get committed through the MediaStore, we'd get TSVs in S3 just like a regular provider DAG, etc. You'd just register a new DAG in provider_workflows.py with the new ingester class.

The only thing that would be an issue is that the upsert step of our provider DAGs currently merges tags rather than overwriting them 🤔 I brought this up on the linked PR too... I'm interested in changing the upsert strategy for tags to overwrite tags from the provider. Obviously we'd have to be careful to ensure we're still merging tags with a different provider field, so that we don't overwrite machine-generated tags for example, but this honestly feels like a bug as implemented. If we could make that upsert a little more sophisticated then I think this would work pretty simply. @AetherUnbound @WordPress/openverse-catalog any thoughts/objections to that in particular?

AetherUnbound commented 4 months ago

That's a good point about potentially overwriting the tags rather than merging them! I wonder though, if it might actually be better to store all of the tags but (similar to provider and accuracy) add another field (perhaps live or current? or even deleted?) which shows that it's no longer present on the upstream source. That would feel more appropriate to me in-line with our desire to keep as much data in the catalog as possible in the long-term. What do you think?

sarayourfriend commented 4 months ago

Madison, in your version, would we basically update the existing tags where provider == current_provider with "live": False or something like that? And then merge in the new tags?

I don't mind that personally, we should add dates to new tags if we're doing that, I'd think, for provenance's sake. It's certainly the least destructive approach, which is great!

I suppose the main difficulty there is that if we set current or live to False/0/whatever, any query to the tags would now need to be aware of those fields. It's a broader change.

How about instead of keeping those "former" tags in tags, we put them in meta_data under something like a tag_history field, with date-string keys mapped to the list of tags from that date? Thinking from a general perspective of tracking the changes to individual works, that establishes a generic pattern we could follow. I haven't checked to see whether the top-level keys of meta_data could clash with provider data/how we organise the data in that column. If it's possible to have a general history field in meta_data that maps field name to an object of date -> value-at-that-date where date is the date of the value being replaced with the "next" value, it would remove the need to add the overhead (operational more than performance) of checking for a current/live/version field on individual tags and has the additional benefit of being able to track changes to any field on a given work.

We could also have a separate table for this, like image_history, etc, that is just (identifier UUID NOT NULL, data jsonb NOT NULL, date DATE NOT NULL, summary TEXT, ???). At least that way there's no chance of meta_data keys clashing, if that's a potential issue, and it doesn't add any logistical/operational overhead to querying the main tables. The same schema could be replicated in S3 if we didn't need it in the database.

I'm on board to take it slow on these considerations for this issue. I've downgraded the priority to medium from high, I think our current stop-gap is OK, and this kind of data-quality and revision work is worth making sure we get right and don't make destructive mistakes.

stacimc commented 4 months ago

I wonder though, if it might actually be better to store all of the tags but (similar to provider and accuracy) add another field (perhaps live or current? or even deleted?)

I'm not opposed, although that raises some questions for me (for example, should deleted tags be ignored in search? I'd argue that might be a good idea, since it's very likely that if a tag was deleted from the record at source, it wasn't a "good quality" tag. Similarly, should they be displayed on the frontend?).

That would feel more appropriate to me in-line with our desire to keep as much data in the catalog as possible in the long-term.

I also feel like this is a little different from, for example, the machine-generated tags below a certain accuracy threshold. In that case those are all perfectly valid tags, and the issue is that we're making decisions about which ones we want to use/display. This case could be thought of as correcting data that has been invalidated upstream (in the way that when the title or description of a record changes at the source, we simply update that in the catalog). But... that is certainly a nuanced distinction! I certainly have no objection to keeping the data (beyond noting that it'll make the upsert a little more complicated, which is totally fine). Those questions about deleted tags in display/search can also be answered later. There's no regression from current behavior :) And the great thing about keeping the data is you can always change your mind later, which is obviously not the case if we delete it.

As far as a tag_history table or something more complicated, I think we should take a step back and do a full project proposal/etc. For example, immediately I'm questioning if we're going to do that, whether we should also be keeping history for other fields that can change upstream. And if not, why?

sarayourfriend commented 4 months ago

As far as a tag_history table or something more complicated, I think we should take a step back and do a full project proposal/etc. For example, immediately I'm questioning if we're going to do that, whether we should also be keeping history for other fields that can change upstream. And if not, why?

Agreed that this is scope enough for a project. My assumption is that yes we should keep that data, if we can, but I'm also wary of the fact that I actually don't have a good reason why we should, and that without a good reason, we risk spending a considerable amount of money storing data we don't currently have a known use for. You're absolutely right to point out that it's very different in that respect from the machine generated tags where the argument to keep all of them is easier to make.

We also don't need to start a full historical record for all works right away. We could, for example, develop an idea for a generic approach but only use it for this targeted reingestion, and see how it goes, and then implement it more broadly if we decide it suits our purposes (and probably more freely iterate with a smaller amount of pre-existing data than if we started doing this for everything right away).

I suppose a big question is about non-dated provider DAGs? Would we diff every field for every work to see if it needs a new history entry?

stacimc commented 4 months ago

I suppose a big question is about non-dated provider DAGs? Would we diff every field for every work to see if it needs a new history entry?

I think we could use triggers here to insert into some history table when there are updates to the media tables, although I'd have to play around with it -- I know you can create a trigger that runs BEFORE update, not sure off the top of my head what the best way to grab just the changed fields would be. You could have a separate TRIGGER before update for each field... Just not sure about performance implications 🤔

It would be very interesting to see if we can come up with any use case for storing this information. Honestly, I have some reservation about storing historical user-generated data from third party sites -- for example, I wonder if a Flickr user who changed the title/tags/etc on an image would be surprised that Openverse is retaining the old data, and if that could be cause for concern. If we scope a project for this I'd like to see that addressed in the proposal.

Circling back to the decode/deduplicate tags issue, do you think it should block on this? We're potentially talking about a pretty big and as yet unprioritized/unplanned project. One thing we could do, using the approach I outlined in this comment, is to add another preingestion task just to the new DAG that drops at least the tags with potential invalid characters. You could argue that tags that have gotten into an invalid state, can be dropped and replaced with current data from the provider. Then the regular load steps will merge in the current tags just fine.

Alternatively the preingestion task could add a stale field to all existing tags, just for this DAG for now. If we go that route instead of dropping we'd still need to consider the questions I asked earlier about whether to search on/display those.

zackkrida commented 4 months ago

Just wanted to mention that I've implemented this postgres trigger history logging pattern in a project before. It was for a Gutenberg-esque CMS which stored page revision history. Postgres triggers are given access to some useful variables:

TG_OP - The operation on the row, create, update, delete
OLD - The row before the operation is applied
`NEW - The updated row

My project had a generic history table shared between different "post types" so I had columns for the "old" record and the "new" record that could be used to produce a diff. Something like this:

CREATE OR REPLACE FUNCTION log_history() RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'UPDATE' THEN
        INSERT INTO history (id, old, new, operation_type, change_timestamp)
        VALUES (OLD.id, row_to_json(OLD), row_to_json(NEW), 'UPDATE', current_timestamp);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO history (id, old, new, operation_type, change_timestamp)
        VALUES (OLD.id, row_to_json(OLD), NULL, 'DELETE', current_timestamp);
    ELSIF TG_OP = 'INSERT' THEN
        INSERT INTO history (id, old, new, operation_type, change_timestamp)
        VALUES (NEW.id, NULL, row_to_json(NEW), 'INSERT', current_timestamp);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

It worked well! The only thing we'll need to test is performance with large batch updates.

Agreed all around that this would be a standalone "data preservation/archiving" project, though.

sarayourfriend commented 4 months ago

It would be very interesting to see if we can come up with any use case for storing this information. Honestly, I have some reservation about storing historical user-generated data from third party sites -- for example, I wonder if a Flickr user who changed the title/tags/etc on an image would be surprised that Openverse is retaining the old data, and if that could be cause for concern. If we scope a project for this I'd like to see that addressed in the proposal.

This occurred to me as well. There's a certain extent to which we should describe works the way the upstream provider describes them. Retaining changed information for search sounds bad, especially if extrapolated to a GLAM institution where descriptions are much more intentional and changes might be reflective of changing institutional policies that we would not want to misrepresent.

Circling back to the decode/deduplicate tags issue, do you think it should block on this? [...] You could argue that tags that have gotten into an invalid state, can be dropped and replaced with current data from the provider.

The issue with this, to me, is that there is no reliable way to detect tags in an invalid state. At the very least, there's no way to determine whether something is in that state because of a mistake we made with the data in the past, or whether it's something intentional from the upstream provider.

Even when scoped to just ones we "suspect" to have the issue, I don't think there's a way to reliably compare across the new set of tags from the provider and our own to determine which are ones we should drop vs keep. Dropping all the old tags (from search) and just keeping the new ones is the only 100% reliable way to make sure we match the upstream.

Your point about it being a big blocker though, is important. We don't need to block on this so long as we store the old tags in such a way that we could backfill any future broader history-tracking solution we decide on. For example, storing the old tags in S3 with minimal metadata stored in the key (identifier + date?) is a fine temporary solution that would allow us to unblock this specific issue, whilst preserving the ability to backfill whatever history-tracking method we try. If anything, choosing the most generic/abstract version of this (e.g., one that is explicitly unoptimised for anything other than storage) is probably the best first choice given we don't have a precise use case for this data now, and unbinds us from needing to turn this into a big project. So long as whatever we choose is minimally viable (date, identifier, explanatory note/event marker, plus the json blob, I think), we can defer an actual design for this kind of thing to later on.

Would that be alright? If we did that, we'd implement the changes to the DAG you suggested, Staci, in addition to loading the "old" row as JSON into S3 before replacing it altogether in the catalog DB. Create the table based on the query as well as upload the "old" rows in create_preingestion_tasks?

The only thing that would be an issue is that the upsert step of our provider DAGs currently merges tags rather than overwriting them

This seems like the biggest blocker to me, it isn't deferrable like tracking the old data, right? Can we introduce a new strategy that does what you've suggested? It should be possible to filter out tags from old from the provider with something like this, I should think:

jsonb_path_query_array(
    old.tags,
    format('$ ? (@.provider != %s)', old.provider)
)

AetherUnbound commented 3 months ago

These are all really good points, and I think asking the question of what do we hope to get from this data is always a good place to start before we invest time into building out a way to store/track/record it.

I'm on board with the approach of making a minimally viable, storage-optimized approach for unblocking this specific issue. That's similar to the plan we've had with #4610 and the data normalization efforts thus-far. As long as we're preserving the old values somewhere, I think that's best here!

sarayourfriend commented 3 months ago

Okay, I'll wait for @stacimc to reply regarding the tag merging issue she brought up, and then we can start to think about concrete implementation details for the first-pass just to get these initial works reingested.

stacimc commented 3 months ago

@sarayourfriend is my understanding correct that you're suggesting we use the approach I outlined in this comment, but also update the upsert query for all of our provider DAGs to:

store minimal 'old' tag data for rows being updated in json in s3
overwrite rather than merge the old tags with the new tags

No objections from me if so, particularly because our load steps are not a performance bottleneck so I'm not particularly concerned about additional complexity there. This is the best minimal solution that keeps the tag history, IMO.

Although I agree we should retain data even if we are initially skeptical of its usefulness (deleted/modified tags were likely low quality tags, as you point out), the issue to me remains privacy/ethical concerns with deliberately retaining this historical data that is intentionally changed at source. I can't think of a use case for this data that would avoid those issues. That being said I have no blocking objection to storing it in s3, and this solution is even a huge step forward, since it will bring tags in our catalog/API up to date. 👍

sarayourfriend commented 3 months ago

overwrite rather than merge the old tags with the new tags

Yes, for the provider. Something like the expression:

new_provider_tags + [old_tag for old_tag in old_tags if old_tag["provider"] == provider]

The SQL I shared above should be (or is intended to be) equivalent to the right hand side of the expression, but I'm not sure I 100% understand the column strategies. In particular, I don't really understand the EXCLUDED reference. Are those rows the new data? That's the only deduction I can make about them, based on the other columns being old. The term excluded is confusing me and I'm having a hard time wrapping my head around the documentation for it.

store minimal 'old' tag data for rows being updated in json in s3

Your point about the ethics is good. Beyond privacy, we have a responsibility to represent works accurately as informed by the upstream data. Keeping removed or changed tags (as we do now by merging old and new data) already creates a problem there. Storing them in S3 doesn't fix that, and I don't know what would be useful about it other than in the case of needing to recover from a mistake. However, I think the right way to do that would be to track the rows we changed, when and why we changed them, and what method we used to change them, in some sort of provenance log. And if we needed to recover corrupted or incorrect data after this reingestion, we could just reingest them. We have database snapshots anyway, if things were really dire.

We aren't an archive, we do not currently have any mandate (whether internal or external) to archive the history of works from upstream providers. Good thing to keep in mind, and the ethics you've raised are a good prompt for that, for sure.

How about this: we export the table we create of identifiers of works to reingest to S3 (and document it appropriately), such that we could easily track down the rows that were modified by this process? Maybe that's something that should go into the batched update anyway, as a way of tracking the fact of changes (if not the specific changes in every case) to rows from that DAG.

sarayourfriend commented 2 months ago

I need to hand this work off, so pinging @krysal as I guess this is part of the data normalisation work now?

I've worked with @stacimc to create an outline of how to approach targeted ingestion (and therefore reingestion) of works for a provider. The term "targeted" refers to the idea of "targeting a specific set of works" for a given provider. The overall plan is to adapt existing ProviderDataIngesters to pull batches from a table of identifiers to ingest from the provider. This mirrors the batched update DAG, where the batches are ProviderDataIngester batches instead of UPDATE ... calls. The DAG will create the table of identifiers to reingest using the create_preingestion_tasks from a new set of workflow dataclasses, and a select_query DAG configuration identical to the batched update DAG. It should use the same approach as the batched update DAG to create the temporary table for the DAG run, and the same approach for iterating through the table in batches, and marking each row after the batch is finished.

Each batch will be pulled in an override of get_batch of the data ingester. get_batch will consult the table to find the next batch, and pass each foreign identifier to get_response_json. When retrieving each batch, get_batch should mark the batch rows as in progress. For each subsequent batch, the previously in progress rows should be marked as completed, and the next batch pulled and updated as in progress. On the first call, there will be no records in progress. On the final call, there will be no new records to pull and should_continue will be returned as False to halt ingestion. get_batch should also interpret the responses from get_response_json and get_batch_data to identify records that failed to ingest, and then indicate those in the table as well. We can accomplish this with a status enum column, which can either be not_started, in_progress, ingested, or failed (something along those lines).

get_response_json will be adapted in the targeted data ingester to create the correct endpoint and query parameters to pull the single record for foreign identifier.

The targeted data ingester will adapt get_record_data as needed to handle the responses from the provider for a single work rather than the individual list items from the provider's bulk endpoints normally used for ingestion.

The targeted ingester will use create_postingestion_tasks in the workflow base class to handle clean up of the temporary table just like the batched update DAG.

I have started on some of this in the following commit: https://github.com/WordPress/openverse/commit/59d2e4cc3970369400f656b614b75c843eea47cc#diff-20a97d97a930c3375c7d432419db3838b806c384842e41ee9c031debe3953391

The code there is just a sketch, and is messy, and doesn't follow some of the specific things above, as I wrote it to show Staci and work from in our discussion.

Staci's additionally advised that for Flickr in particular, due to the regular ingester being time delineated, it will be easier to understand the provider if we extract the record building into a record-builder class similar to EuropeanaRecordBuilder, "and then have a FlickrProviderIngester which extends the TimeDelineatedProviderIngester and a FlickrTargetedIngester which doesn’t, both using the record builder.".

For additional clarification:

We will not immediately or necessarily ever create targeted ingesters for every provider. We will create them only as needed.
The intention of using create_preingestion_tasks and create_postingestion_tasks in a new workflow is to be able to use the existing DAG factory for regular ingesters. Therefore, we should also add a new ingestion type tag for DAGs that are targeted. If it becomes too complex to use the same DAG factory as regular ingesters, then do not feel tied to using the pre/post ingestion tasks approach, but @stacimc should be able to advise on how best to proceed if that ends up being the case.

@krysal I'm unable to continue working on this myself because it's much more significant work than we originally planned, and I need to focus on some infrastructure things, including the #3336 IP, and some pending security improvements we need to make, and infrastructure planning. In other words, I'm rather spread thin, and as this is on the edge of my typical responsibilities and work on the project, it would be better for me to hand it off to someone, whether that is you or someone else on @WordPress/openverse-catalog.

@stacimc please do correct and/or clarify anything I've shared above as you see fit.

sarayourfriend commented 2 months ago

Oh, I would also add, it might be a good idea to split this into a couple PRs, and maybe use a simpler provider than Flickr to proof-of-concept the idea, before getting too bogged down with Flickr details, if necessary. On the other hand, the complexity of Flickr might be the best way to prove the concept in the first place, but multiple PRs will probably be easier to understand and review nonetheless!

stacimc commented 2 months ago

@sarayourfriend I wouldn't mind working on this, myself, given that I've already been a big part of the discussion! I'm wrapping up the last big (non-test) ingestion server removal PR this week, so as long as it's not critical that work continue on this immediately I'm actually already invested/very interested in this issue. Happy to answer questions if @krysal really wants to take it though!

AetherUnbound commented 2 months ago

I have seen this and will respond to this discussion when I have time!

zackkrida commented 2 months ago

@stacimc I've assigned you to this issue based on your expressed interest! There is no urgency to begin the work until you're able :smile: I've moved it to the backlog as well until you're ready to prioritize it.

stacimc commented 3 weeks ago

Title edited to reflect the new intention for this work, using targeted reingestion.