daviddavo / pocket2omnivore

A Jupyter Notebook to upload your articles to omnivore
https://blog.ddavo.me/posts/tutorials/pocket-to-omnivore/
MIT License
6 stars 3 forks source link

publishedAt not working #2

Closed daviddavo closed 1 year ago

daviddavo commented 1 year ago

When I go to my inbox after running the script, all post appear as recently added. The original date from Pocket is not saved.

Tasks:

Podginator commented 1 year ago

https://github.com/omnivore-app/omnivore/issues/2316

If you want to fix this, basically, upload everything. Wait for everything to become stable, then run the "update time" bit again.

daviddavo commented 1 year ago

Perhaps we should check if It's populated before continuing with the query articleSavingRequest. Instead of going by batches, we should use a heterogeneous queue of events. First, try and populate some articles and then do queries to check if they are populated. When they are populated then archive it and change the date.

In a kind of pseudo-code:

pq = processQueue()

fun saveArticle(url):
  r = gql(...)
  pq.push(checkArticle, r['id'])

fun checkArticle(id):
  r = gql(...)
  if is_populated(r):
    pq.push(archiveArticle, id)
  else:
    # Do some kind of sleep or wait
    pq.push(checkArticle, id)

fun archiveArticle(id)
  gql(...)
Podginator commented 1 year ago

Yeah, that sounds like a good approach. The batching was to try to speed things up - but given the processing time it doesn’t seem to help all that much.

Before I realised what was happening with the processing I tried to batch 10, then update each. My only worry, and I haven’t checked the code for this, but I noticed some articles never leave the processing state - so we’d need some escape hatch if it gets fully stuck.

daviddavo commented 1 year ago

We can assume that they eventually leave, perhaps after a few days. We should use the DB to check if we already tried to archive an article and avoid re-populating (basically, saving the articleSavingRequeststatus in the DB)

Then, after every article has tried to be populated, make queries to check its status, and:

Making it in two stages is better because is "resumable", and you could always stop the notebook and run it again in a few days, as long as the database is the same it should avoid re-populating all articles again.

daviddavo commented 1 year ago

I tried using a threadpool, but the main problem is that I don't know how to "enqueue" the follow-up tasks of checking and updating the article.

Both ThreadPool and ThreadPoolExecutor are used to submit a lot of homogeneous tasks (with something like map), and then process those results, and then issue more homogeneous tasks. I don't think what I want is possible with this method.

All the example code looks like this:

futures = ...

for completed_future in wait(futures):
  ...

This is not suitable for our case, because we don't want to wait for ALL ARTICLES to be processed to start updating the info...

The only solution I can think of is a loop like this:

<save all articles and get rid>

remaining = [ ... ]
while remaining:
  < async check and update remaining list >

I'd like it to be able to intercalate the tasks, but I guess there's no other solution.

daviddavo commented 1 year ago

In the branch associated to this commit you can find a working version which checks when the article is processed and retries if data has changed (not resumable as I didn't do the DB connection yet)