aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Arxiv and PsyArchiv PDF parsing as a custom plugin(s) #19

Open akolonin opened 4 years ago

akolonin commented 4 years ago

Need separate Socializer-derived plugin(s) for Arxiv and PsyArchiv PDF parsing

  1. Arxiv - follow one of the options: 1.1. Use Arxiv search API with results returned in Atom Feed format, see: https://arxiv.org/help/api#using , https://arxiv.org/help/api/user-manual, https://arxiv.org/help/api/user-manual#query_details and http://export.arxiv.org/api/query?search_query=all:agi 1.1.1. In custom version, can use query parameters "start" and "max_results" to iterate over the full document collections "search_query=anton kolonin&id_list=&start=0&max_results=10" (can be also done as a hack in RSSer translation URLs containing "arxiv.org" into API calls like "http://export.arxiv.org/api/query?search_query=agi&start=2&max_results=2") 1.1.2. In custom version extra fields of the feed can be used, see https://arxiv.org/help/api/user-manual#query_details 1.2. Implement custom crawling with custom crawler plugin (like RSS) on Aigents side, based on #5 1.3. Implement Aigents-side URL filtering logic per site/user/instance for A) URLs not crawled and B) URLs not used to create news items

  2. PsyArchiv:
    2.1. TODO

  3. Random Issues: 3.1. pdfs not read from site in agi channel 3.2. https://arxiv.org/list/cs.AI/recent 3.3. enable scope=web as default ? 3.4. missed https://arxiv.org/list/cs.AI/recent for 'knowledge representation'

P.S.: Suggestions from Eyob: