coding-blocks / content-downloader

Python package to download files on any topic in bulk.
MIT License
8 stars 25 forks source link

Added support for downloading files from a specific website. #7

Closed rishabhnambiar closed 7 years ago

rishabhnambiar commented 7 years ago

I have added a 'site:' parameter for the search query which filters results from the given website only. Fixes #6

Example: ctdl python -w=github.com will download pdfs only from github.com.

nikhilkumarsingh commented 7 years ago

Please modify the print statement at line number 155 in ctdl.py .

nikhilkumarsingh commented 7 years ago

I think we need to customize ctdl for downloading files from github.com . For example, Google will return links like this: https://github.com/emilmont/Artificial-Intelligence-and-Machine-Learning/blob/master/ML/ex8/ex8.pdf

But this is a HTML page. Actual raw data is at this link: https://github.com/emilmont/Artificial-Intelligence-and-Machine-Learning/raw/master/ML/ex8/ex8.pdf

So, you will need to replace /blob/ with /raw/ in URLs received from search function if website is set to github.com .

rishabhnambiar commented 7 years ago

Oh! Nice catch, I will fix that issue now. I have also added the GUI changes @nikhilkumarsingh.

rishabhnambiar commented 7 years ago

@nikhilkumarsingh, I have added support for github.com. I have checked the url for github.com and corrected it in the validate_links() function after the links have been validated. Any other issues?

nikhilkumarsingh commented 7 years ago

You can make the link validation generic. No need to send website name because github results can appear in general cases as well.

rishabhnambiar commented 7 years ago

Haha yes sorry I don't know how I missed that, It's done! @nikhilkumarsingh

nikhilkumarsingh commented 7 years ago

Line 100, gquery = "filetype:{0} {1} site:{2}".format(file_type, query, website) can be improved as: gquery = "filetype:{0} site:{1} {2}".format(file_type, website, query)

rishabhnambiar commented 7 years ago

@nikhilkumarsingh, Is this correct? I checked for NoneType and then declared the query accordingly.

nikhilkumarsingh commented 7 years ago

Please make changes only to ctdl.py . Just revert your commits in gui.py. PR #8 will cover the GUI modifications since it also fixes some bugs.

rishabhnambiar commented 7 years ago

Yes okay, I've reverted the changes in gui.py.

nikhilkumarsingh commented 7 years ago

Set a message that "No files found" if no valid links are obtained.

rishabhnambiar commented 7 years ago

I set the message, but it still shows "Download complete." from downloader.py To make that go away, I'll have to import ctdl in downloader. Should I do it?

nikhilkumarsingh commented 7 years ago

Why use downloader.py if no valid links are found? You can simply show a message and do sys.exit() .

rishabhnambiar commented 7 years ago

Done!