alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.24k stars 654 forks source link

Github issue numbers #49

Closed cory171185 closed 3 years ago

cory171185 commented 3 years ago

Thanks for creating such a cool project! It looks like it's exactly what I need, but I'm having trouble getting it to work for Github issue numbers.

Example code with this own project's issues page

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper/issues?q=is%3Aissue'

wanted_list = ["#47"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

The result here is blank, and checking the stack_list, there is nothing there. The formatting for the element is a bit strange with lots of whitespace and newlines, so I tried copying the whole element in directly with triple parenthesis, but that has the same result.

wanted_list = """
          #47
            by """

Which when evaluated by python becomes

['\n          #47\n            by ']

Originally I also tried just using the number, as that would be the most convenient, but no beans. I was able to get it to work easily with the actual text of the issue, so I fear it's something weird with the way it's formatted.

Is this an issue with whitespace, or am I messing up something basic? Thanks!

alirezamika commented 3 years ago

The scraper looks for full text match of elements, and there's no element with #47 as its text in the page. The whitespaces are stripped from back and front of the texts but not from the middle. you can use text_fuzz_ratio attribute to set a fuzziness for text matching to overcome these weird formations. Something like this:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper/issues?q=is%3Aissue'

wanted_list = ["#47 opened 9 days ago by programmeddeath1"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list, text_fuzz_ratio=0.6)
print(result)

You can then finetune the rules with remove_rules and keep_rules methods.

cory171185 commented 3 years ago

Thanks! I was able to get it to work by using the whole line and messing with text_fuzz_ratio.