elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
971 stars 114 forks source link

robots.txt matching is pretty buggy #254

Closed serpent213 closed 6 months ago

serpent213 commented 1 year ago

@oltarasenko I just noticed you are maintaining your own fork of Gollum. I found its matching algorithm having some severe bugs and also feels hard to reason about.

Just wanted to draw your attention to two pull requests, which introduce additional tests.

Might rewrite the matcher, but first things first.

oltarasenko commented 1 year ago

@serpent213 yeah, I forked the Gollum a few years ago due to the same reason. I had the plan to fix the parsing there but did not have enough time to do that. Unfortunately :(.

fstp commented 1 year ago

Hi,

I would like to pick this task and start improving the parser if that is okay. @serpent213 could you resubmit your pull request to the forked repo from oltarasenko? (https://github.com/elixir-crawly/gollum)

After it has been merged we can start to work on improving the parser in that fork since it seems like the original project is abandoned by now.

oltarasenko commented 1 year ago

@fstp Please let me know if I can help you in any way.

serpent213 commented 1 year ago

@fstp See https://github.com/elixir-crawly/gollum/pull/1.

fstp commented 1 year ago

Thank you for the quick response! :) @oltarasenko if you want you could merge the pull request from @serpent213 and then I can create my pull request based on that. Or is it better that I push directly to the pull request?

serpent213 commented 1 year ago

One of the commits introduces a failing test, IIRC. The Google-port has a lot of stuff tagged as skip to pass.

fstp commented 1 year ago

Ah okay I see, can I clone your pull request somehow and push the code directly there? Would be nice to run test driven development against the failing testcases

serpent213 commented 1 year ago

Maybe you can start your own branch and cherry-pick the commits?

Edit: Added you to my fork, @fstp, not sure that is of any help. Note that my fork is based on https://github.com/ravern/gollum, but it's rebased to https://github.com/elixir-crawly/gollum/.

fstp commented 1 year ago

Thanks! I will try to work directly on your fork, seems most convenient since we can just merge it all together then :)

oltarasenko commented 6 months ago

We have updated the Gollum a bit. Probably bugs related to it's implementation can go to it's repo.