BeastImran / libgenparser

An easy and advanced libgen.rs scraper. Can search books, filter results, download books etc. Also supports asyncio.
7 stars 2 forks source link

scimag/journals scraping #1

Open Type-IIx opened 3 years ago

Type-IIx commented 3 years ago

Great project so far. I wonder if you could add a feature to scrape journals? I'd like to configure the parser for that.

BeastImran commented 3 years ago

Great project so far. I wonder if you could add a feature to scrape journals? I'd like to configure the parser for that.

Sure, will do that and release an update soon.

BeastImran commented 3 years ago

@Type-IIx would you like to get involved in the project? This package would be even better when i know what kind of functionality and structure you will love to use.

Type-IIx commented 3 years ago

@Type-IIx would you like to get involved in the project? This package would be even better when i know what kind of functionality and structure you will love to use.

Perhaps! For now, what I envision, is scraping journals. Example (try this URL yourself for a generic example):

http://{{libgen_root}}/scimag/journals/32406 =>

<h1 class="header"><a href="http://libgen.rs/">Library Genesis</a>: <a href="http://{{libgen_root}}/scimag/">Scientific articles</a></h1>
<p style="margin:1em 0;text-align:center">
Current alias domains are @[{{libgen_root0, libgen_root1..}}, ...

This is a random journal/scimag, basically went to {{libgen_root}}/scimag/journals/ and 32406 <=> [A][0] from browsing into the scimag/journals/ tree.

Journal page [template] The journal page gives:

Journal: | {{Journal_Title}}

Publisher: | {{Publisher}}
ISSN (print): | {{ISSN}}
Website: | http://{{journal_site}}
Description:
{{description}}

All articles: | search | DOI list

YYYY | Volume 1 ... n

If you could provide methods such that libgen.journal (or scimag).download_year(), or .search_issn(), .search_title() to return a hash/array of a hash/array, eg [2009,2008,...] => [1,2,3,4] of this journal page's journals/scimags. Further, you should be able to download() by year(s), issue(s), DOI(s)[!!!]. DOIs are particularly important for reference here, and it would be great to be able to search by DOI (the DOI per each journal is always always the journal/scimag root followed by /doi; probably (untested) https?://{{libgen_root}}/scimag/\d{5}/doi

So, perhaps a few different download and search methods warranted.

P.S. I am no Python wizard!

Type-IIx commented 3 years ago

I see that the API (JSON) does not support scimag yet! I went to the source (https://forum(dot)mhut(dot)org/viewtopic.php?f=17&t=6874) to ask about this.

I would suggest, you have the freedom to build methods that do not rely on their API; or you could contribute to their API.

A (clunky) method that apparently works: Right now you can take SQL database dump in "Download" section of main sites http://{{libgen_root}}/scimag/ and search articles (e.g., by DOI) in it.

BeastImran commented 3 years ago

I have already implemented journal, mags are on the way. The problem is i have very constrained time to spend on this project, But i am spending some.

Their API is not very cool so i had to scrape the results for better data.

Any way, i hope you will love the next release, i will probably publish it in 10 to 15 days or so 😅 It will take time.

Thank you for showing interest 👍🏻