New scraping - Githubissues

Tisila commented 5 years ago

So far I've managed to implement a working recipe ID scraping for html downloads. It is aware of files inside the working directory and only downloads the files left.

The next step will be to create a markdown file with index based on cookidoo categories to search and open the recipes. Also would be interesting to do the same but based on user bookmarks.

Finally will create a html to markdown parser and convert markdown to pdf file.

Tisila commented 5 years ago

I just realized it's only working for portuguese version, making changes...

Simsal commented 5 years ago

Hey @Tisila , is it just working for portuguese because you hardcoded the .pt ending for the website, or are there other reasons as well?

Tisila commented 5 years ago

Hey there @Simsal , It's just working for portuguese because I hardcoded the .pt in the initial link and in the recipeToFile function. I already started this change, it's almost complete.

Tisila commented 5 years ago

@Simsal All the hardcodings around the locale have been resolved. Test it out and let me know if you have any issues. Happy scraping!

auino commented 5 years ago

If both of you wish, I can add you as project developers.

Tisila commented 5 years ago

@auino ,You already added me as a collaborator, that's the same thing, right?

auino commented 5 years ago

Yes, I saw it later than writing the message (sorry). I've tried the new version, but it doesn't work for macOS, even renaming the binary file referenced into the script to chromedriver.

Tisila commented 5 years ago

No problem ;-) Have you downloaded the correct chromedriver and do you have Chrome browser installed? Although I don't know what kind of error message you're getting, I did a quick search and found out that the webdriver may have the wrong permissions. StackO link

Simsal commented 5 years ago

Also getting following error. Also on MacOs :)

auino commented 5 years ago

Also getting following error. Also on MacOs :)

This can be easily solved by installing selenium:

sudo pip install selenium

auino commented 5 years ago

No problem ;-) Have you downloaded the correct chromedriver and do you have Chrome browser installed? Although I don't know what kind of error message you're getting, I did a quick search and found out that the webdriver may have the wrong permissions. StackO link

It works if the ./chromedriver path is considered (I've added it as an optional parameter, in a temporary script). Nevertheless, the cookidoo.it domain input is not working (returning a NameError: name 'it' is not defined error).

Tisila commented 5 years ago

I was not expecting that one... The "pt" or "it" location is just added to the baseURL I can't see what I did wrong. I'll check it later on today.

Simsal commented 5 years ago

It works if you enter it with single quotes -> 'de'

After Login and pressing enter the script terminates with following error

Tisila commented 5 years ago

I can see that there are some inconsistencies between winodws and mac. I think I know how to solve it. hold on!

Tisila commented 5 years ago

I tried two different terminals and it worked out ok. I made some changes that might solve the issue but it's not guaranteed. I don't have mac but a more universal solution would be to use a docker container, just a thought. I would happily create a docker file to get this up and running.

auino commented 5 years ago

I've tried it and it works great. I believe that, in order to enhance it, single quoted input should removed. Also, the dockerization would be good. I'll now accept your pull request, hence make minor changes for multi-platform support (by adding input parameters to the script).

auino commented 5 years ago

Found a way to remove single quotes, with raw_input() (see https://stackoverflow.com/questions/37404134/in-python-is-there-anyway-to-input-a-string-without-quotation-marks).

Tisila commented 5 years ago

I understand what this does and agree with the solution. I just have one silly question, are you running this with Python v2 or v3?

auino commented 5 years ago

Python 2.7.10

Tisila commented 5 years ago

There it is! Although I specified in the start of the file and in the first commit, I forgot to mention it here...

cookiscrap.py is Python 3

I'm sorry for your inconvenience.

auino commented 5 years ago

Well, the current version should work on v2 too.

Tisila commented 5 years ago

Ok, let's make the raw_input change and keep it v2 hence it's a simple solution.

auino commented 5 years ago

We could also dynamically detect the Python version (see this post on StackOverflow) and use the input/raw_input function accordingly.

Tisila commented 5 years ago

You're right, it's the right way to have this working correctly. I thought that raw-input worked in v3. Started making those changes in new-parser branch. The input is working great!

Tisila commented 5 years ago

You can delete new-scraping if you wish. New developments will be made in new-parser.

auino / cookidump

New scraping #4