Arxiv and PsyArchiv PDF parsing as a custom plugin(s)

Need separate Socializer-derived plugin(s) for Arxiv and PsyArchiv PDF parsing

Arxiv - follow one of the options: 1.1. Use Arxiv search API with results returned in Atom Feed format, see: https://arxiv.org/help/api#using , https://arxiv.org/help/api/user-manual, https://arxiv.org/help/api/user-manual#query_details and http://export.arxiv.org/api/query?search_query=all:agi 1.1.1. In custom version, can use query parameters "start" and "max_results" to iterate over the full document collections "search_query=anton kolonin&id_list=&start=0&max_results=10" (can be also done as a hack in RSSer translation URLs containing "arxiv.org" into API calls like "http://export.arxiv.org/api/query?search_query=agi&start=2&max_results=2") 1.1.2. In custom version extra fields of the feed can be used, see https://arxiv.org/help/api/user-manual#query_details 1.2. Implement custom crawling with custom crawler plugin (like RSS) on Aigents side, based on #5 1.3. Implement Aigents-side URL filtering logic per site/user/instance for A) URLs not crawled and B) URLs not used to create news items
PsyArchiv:
2.1. TODO
Random Issues: 3.1. pdfs not read from site in agi channel 3.2. https://arxiv.org/list/cs.AI/recent 3.3. enable scope=web as default ? 3.4. missed https://arxiv.org/list/cs.AI/recent for 'knowledge representation'

P.S.: Suggestions from Eyob:

For arxiv, we need to crawl only “abs” links (eg. https://export.arxiv.org/abs/2005.05255). Some weird pages like formats are being crawled. I think the crawler needs to crawl smartly, in a site specific way (although hard coded for now, future AGI-sh implementation of this would take care of this automatically : D). Only article like pages should be shown to the user. Eg. bad contents crawled in the current feed setup on the staging site (staging.xcceleran.do) are https://export.arxiv.org/list/cs.SY/pastweek?skip=65&show=25 https://export.arxiv.org/format/2005.04589 The above links have ‘list’ and ‘format’ in their url instead of ‘abs’
Titles that come from arxiv are not properly crawled. E.g the title from https://export.arxiv.org/abs/2005.05178 is “other ] title: reinforced rewards framework for text style transfer learning ( cs.lg ) [ 110 ] arxiv:2005.05178 ( cross-list from cs.ro ) [ pdf” . We need to have a mechanism to get this done correctly. Eg. using the title tag of the page?

aigents / aigents-java