Closed daviddavo closed 1 year ago
https://github.com/omnivore-app/omnivore/issues/2316
If you want to fix this, basically, upload everything. Wait for everything to become stable, then run the "update time" bit again.
Perhaps we should check if It's populated before continuing with the query articleSavingRequest
. Instead of going by batches, we should use a heterogeneous queue of events. First, try and populate some articles and then do queries to check if they are populated. When they are populated then archive it and change the date.
In a kind of pseudo-code:
pq = processQueue()
fun saveArticle(url):
r = gql(...)
pq.push(checkArticle, r['id'])
fun checkArticle(id):
r = gql(...)
if is_populated(r):
pq.push(archiveArticle, id)
else:
# Do some kind of sleep or wait
pq.push(checkArticle, id)
fun archiveArticle(id)
gql(...)
Yeah, that sounds like a good approach. The batching was to try to speed things up - but given the processing time it doesn’t seem to help all that much.
Before I realised what was happening with the processing I tried to batch 10, then update each. My only worry, and I haven’t checked the code for this, but I noticed some articles never leave the processing state - so we’d need some escape hatch if it gets fully stuck.
We can assume that they eventually leave, perhaps after a few days. We should use the DB to check if we already tried to archive an article and avoid re-populating (basically, saving the articleSavingRequeststatus
in the DB)
Then, after every article has tried to be populated, make queries to check its status, and:
@backoff
)Making it in two stages is better because is "resumable", and you could always stop the notebook and run it again in a few days, as long as the database is the same it should avoid re-populating all articles again.
I tried using a threadpool, but the main problem is that I don't know how to "enqueue" the follow-up tasks of checking and updating the article.
Both ThreadPool
and ThreadPoolExecutor
are used to submit a lot of homogeneous tasks (with something like map), and then process those results, and then issue more homogeneous tasks. I don't think what I want is possible with this method.
All the example code looks like this:
futures = ...
for completed_future in wait(futures):
...
This is not suitable for our case, because we don't want to wait for ALL ARTICLES to be processed to start updating the info...
The only solution I can think of is a loop like this:
<save all articles and get rid>
remaining = [ ... ]
while remaining:
< async check and update remaining list >
I'd like it to be able to intercalate the tasks, but I guess there's no other solution.
In the branch associated to this commit you can find a working version which checks when the article is processed and retries if data has changed (not resumable as I didn't do the DB connection yet)
When I go to my inbox after running the script, all post appear as recently added. The original date from Pocket is not saved.
Tasks:
PROCESSING
. By default it should be Null.articleSavingRequeststatus
to update the DB