alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Different behaviour between FakeRedis and real redis #144

Closed simonwoerpel closed 3 years ago

simonwoerpel commented 3 years ago

Hey,

I encountered an issue when writing a recursive crawler. With recursive I mean:

the config section looks like this:

  fetch:
    method: fetch
    handle:
      pass: parse

  parse:
    method: parse
    params:
      store:
        mime_group: documents
      include_paths:
        - ".//div[@class='artikel']"  # find urls for 1st iteration
        - ".//div[@class='archiveArticleInfo']/ul/li[1]"  # find urls for 2nd iteration
        - ".//div[@id='buttons']/div[@class='save']"  # find urls for 3rd iteration (these are never emitted in debug mode, but in deployed mode)
    handle:
      fetch: fetch
      store: store

in this scenario, memorious in debug mode (via memorious run my_crawler) never happens to fetch in the third iteration, but when building via docker and running it then, it does.

Of course it would be great if crawler execution behaves the exact same way for local developement :upside_down_face:

@pudo and me had a short discussion about it and we came up with the idea that it has something to do with fakeredis vs. "real" redis, as the data dictionary is sometimes altered in place in the code, which has different side-effects when using fakeredis than using the real redis...

I already tried to do a data = data.copy() before this line: https://github.com/alephdata/memorious/blob/master/memorious/operations/parse.py#L58 but this doesn't help.

I am sure someone who knows the memorious codebase better (looking at you @sunu :joy: ) can point me into the right direction how to fix this...

sunu commented 3 years ago

@simonwoerpel Can you check if the latest version fixes the issue for you?

sunu commented 3 years ago

Hoping the fix worked. Feel free to reopen otherwise.