StampyAI / alignment-research-dataset

Stampy's copy of Alignment Research Dataset scraper
https://huggingface.co/datasets/StampyAI/alignment-research-dataset
MIT License
9 stars 7 forks source link

Article checker #182

Closed mruwnik closed 11 months ago

mruwnik commented 1 year ago

Every 4h goes picks 100 articles that haven't been checked in the last 4 weeks and checks:

Thomas-Lemoine commented 1 year ago

If my understanding is right, validation's purpose is: 1) for urls that have parsers, link to a pdf or link to an epub, it updates them somewhat. 2) regardless of 1), check that the url works, and update the article's status accordingly. 3) update the date_checked

It seems to me that 1) and 2+3 are pretty different tasks? Would it maybe make more sense to have the validator file have the following purpose: It goes through all sources, and presses on some of the links to see if they work. if they don't, add article.status of "Unreachable url", but update the date_checked in any case. It could have a name like "check_article_url", to indicate that it's only really checking that the url hasn't closed or something, since it was fetched originally.

And then, we could just refetch parsers + pdfs + epubs more often?

In other words. validator.py would essentially validate that urls work, whereas regular fetching of item_metadata stuff would update some of the fields of the article if needed. In that case, validator would probably make even more sense in common/, but idk.

it's possible that you preferred that parsable urls or pdfs or epubs would benefit from being updated batch by batch, some every 4 hours, rather than them being updated when fetching, every week or so. If so I'm curious why

mruwnik commented 1 year ago

Regular fetching more often won't work, as the regular fetchers only get new data - they ignore things they've already seen. The basic assumption is that things won't change that often, but it's worth checking every now and then if there are updates. It would be possible to always recheck all urls, but that would take a LOT more time than the current process of only looking for new items. Ideally, I'd like to move away from there being multiple ways of fetching data to having the list of parsers also contain things that are handled in other places (e.g. have parsers for LW) in there. This is also a mechanism that would sorta-automatically fix any articles that previously couldn't be parsed, e.g. because they didn't have a domain handler.

Thomas-Lemoine commented 1 year ago

Hmmm, I see. would something like

mruwnik commented 1 year ago

To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours. An issue with the --rebuild option is that datasets would always have to do upserts, rather than inserts. Though that actually might be a good thing?

Thomas-Lemoine commented 1 year ago

To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours.

I see. I'm completely on board for doing manageable batches at a time for updating. Suppose we call updating_article the act of making a single article more up-to-date, which currently only works for parsers + pdfs + epubs, and we call validating_url the act of checking the status code of a fetch of a url and seeing if it works (and modifying .status accordingly), as well as updating the date_checked. Then, I have a slight preference for separating them since, as far as I understand, this would save some (maybe significant? unsure) fetching time, and give us control over what we do. This is an example of what I have in mind:

This way, datasets that get very little marginal value from getting updated (like arxivs; there are a lot of them and very few are being meaningfully modified) can be updated very rarely or never, whereas datasets that get a lot of value from updates (aisafety.info especially) can be updated every day or so. Finally, datasets that can be updated url-by-url, which gives the advantage of being updateable only using mysql's url data (parsable urls), can be updated in small batches, which spreads out the updates.

Maybe this makes it needlessly more complex, though. Also kind of points to "date_updated" for update_article being implied by the existence of "date_checked" for check_article, but I'm not sure; an advantage of combining those two tasks is that those two dates are combined.

An issue with the --rebuild option is that datasets would always have to do upserts, rather than inserts. Though that actually might be a good thing?

I don't know for sure, but yeah in my mind upserts are always better, I think? We keep old information if it's not replaced by new, but any new information is assumed to be improved or better than the original. One potential risk is that some url has its text replaced with "This article is no longer available" or something, which would be a bummer. This relates to the fact that in the validator code, I would have a slight preference to the heuristic being something more like if len(new_text) <= 0.5*len(old_text): skip, otherwise update to new_text; an update that prunes or condensates a few paragraphs seems as worthy of replacing the old text as anything

mruwnik commented 11 months ago

I considered adding special handling for that, but decided against it - the entry won't be deleted, it'll just have its status changed. So it should be possible to then just check items with that error and remove the error status, at which point it should be reindexed. Of course, if this starts being an issue it'll have to be addressed. There isn't a mysql trigger, but the main updater task will add unadded items to pinecone, and remove items that are in it but have error statuses