Closed agrizzli closed 3 years ago
Hey agrizzli. Good catch. Unfortunately we have not yet added Spinx to the Docker setup yet and it is required for 4chan, 8chan, and 8kun. If you need those sources, for the moment, you will need to install 4CAT directly and then follow these instructions to set up Sphinx.
I will add to the Readme and Wiki to make it clearer that is the case. Sorry about that.
Thank you, dale-wahl! I could create queries now, which finish very fast as Query finished, but no results were found.
Possibly I have some misunderstanding on how this tool is working. Can I only conduct searches on the data, which has been scraped in advance? This means, that 4cat won't make a direct search on 4chan trying to look for keywords? This leads me to the assumption that 4cat cannot scrape historical data, e.g. from January 2021 Is this true?
You should be able to scrape historical data. When creating a dataset, there is a date range option that should be set (I'm admittedly not 100% sure what it does if you don't set a date range so am checking at the moment). Can you take a look at the logs (4cat.log and 4cat.stderr) and let us know if there are any errors there?
After creating a simple dataset (4chan/pol, post contains "black" from 1st to 31st Jan 2021) there are no error messages on 4cat.stderr. On 4cat.log just appears:
INFO (processor.py:890): Running processor 4chan-search on dataset e621b865802bb41520dd027a4b15d6d0
INFO (search.py:890): Querying: {'board': 'pol', 'body_match': 'black', 'subject_match': '', 'country_code': 'all', 'search_scope': 'posts-only', 'min_date': 1609455600, 'max_date': 1612047600, 'user': 'autologin', 'datasource': '4chan', 'type': '4chan-search', 'pseudonymise': True}
INFO (search_4chan.py:890): Running Sphinx query SELECT thread_id, post_id FROM '4chan_posts' WHERE timestamp >= 1609455600 AND timestamp < 1612047600 AND board = 'pol' AND MATCH('@body black') LIMIT 5000000 OPTION max_matches = 5000000, ranker = none, boolean_simplify = 1, sort_method = kbuffer, cutoff = 5000000
INFO (search_4chan.py:890): Sphinx query finished in 0 seconds, 0 results.
For me it looks like that 4cat only asks Sphinx's searchd
and no other external service to gather data. I have hoped that you provide an access to historical data in some way, since 4chan itself seems not to have an archive and historical data. Thus, as I understand now, 4chan historical posts could not even be scraped at all, but only loaded from some source, which has scraped it at the right time.
Hi @agrizzli, if you setup your own 4CAT instance with a 4chan data source, it should start collecting data from that point onward (note that you do have to index the posts for fast text search first).
If you want historical data, you'll have to import archives. 4plebs has some recent archives that you can import with this script.
Let me know if this answers your question!
Great! Thank you very much!
I have tried to create a scoped search using
fourcat
datasource to scrape posts frompol
channel. However the job does not proceed. Database entry forstatus
in the tablejobs
says:["None", "Crash during execution"]
My
DATASOURCES
config is as follows."4chan": { "interval": 60, "boards": ["pol"], "autoscrape": False },
Duringdocker-compose up
no error corresponding messages are shown. Is there any way to access the corresponding error messages, which seems to happen in the worker?P.S.: Although
autoscrape
is set toFalse
, after start the tool starts to scrape current posts from 4chan without considering the filter criteria from the scoped search. Any suggestions on that?