digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.
Other
258 stars 61 forks source link

4chan, 8chan, and 8kun not currently support in Docker installation #154

Closed agrizzli closed 3 years ago

agrizzli commented 3 years ago

I have tried to create a scoped search using fourcat datasource to scrape posts from pol channel. However the job does not proceed. Database entry for status in the table jobs says:

["None", "Crash during execution"]

My DATASOURCES config is as follows.

"4chan": { "interval": 60, "boards": ["pol"], "autoscrape": False }, During docker-compose up no error corresponding messages are shown. Is there any way to access the corresponding error messages, which seems to happen in the worker?

P.S.: Although autoscrape is set to False, after start the tool starts to scrape current posts from 4chan without considering the filter criteria from the scoped search. Any suggestions on that?

dale-wahl commented 3 years ago

Hey agrizzli. Good catch. Unfortunately we have not yet added Spinx to the Docker setup yet and it is required for 4chan, 8chan, and 8kun. If you need those sources, for the moment, you will need to install 4CAT directly and then follow these instructions to set up Sphinx.

I will add to the Readme and Wiki to make it clearer that is the case. Sorry about that.

agrizzli commented 3 years ago

Thank you, dale-wahl! I could create queries now, which finish very fast as Query finished, but no results were found. Possibly I have some misunderstanding on how this tool is working. Can I only conduct searches on the data, which has been scraped in advance? This means, that 4cat won't make a direct search on 4chan trying to look for keywords? This leads me to the assumption that 4cat cannot scrape historical data, e.g. from January 2021 Is this true?

dale-wahl commented 3 years ago

You should be able to scrape historical data. When creating a dataset, there is a date range option that should be set (I'm admittedly not 100% sure what it does if you don't set a date range so am checking at the moment). Can you take a look at the logs (4cat.log and 4cat.stderr) and let us know if there are any errors there?

agrizzli commented 3 years ago

After creating a simple dataset (4chan/pol, post contains "black" from 1st to 31st Jan 2021) there are no error messages on 4cat.stderr. On 4cat.log just appears:

INFO (processor.py:890): Running processor 4chan-search on dataset e621b865802bb41520dd027a4b15d6d0

INFO (search.py:890): Querying: {'board': 'pol', 'body_match': 'black', 'subject_match': '', 'country_code': 'all', 'search_scope': 'posts-only', 'min_date': 1609455600, 'max_date': 1612047600, 'user': 'autologin', 'datasource': '4chan', 'type': '4chan-search', 'pseudonymise': True}

INFO (search_4chan.py:890): Running Sphinx query SELECT thread_id, post_id FROM '4chan_posts' WHERE timestamp >= 1609455600 AND timestamp < 1612047600 AND board = 'pol' AND MATCH('@body black') LIMIT 5000000 OPTION max_matches = 5000000, ranker = none, boolean_simplify = 1, sort_method = kbuffer, cutoff = 5000000 

INFO (search_4chan.py:890): Sphinx query finished in 0 seconds, 0 results.

For me it looks like that 4cat only asks Sphinx's searchd and no other external service to gather data. I have hoped that you provide an access to historical data in some way, since 4chan itself seems not to have an archive and historical data. Thus, as I understand now, 4chan historical posts could not even be scraped at all, but only loaded from some source, which has scraped it at the right time.

sal-uva commented 3 years ago

Hi @agrizzli, if you setup your own 4CAT instance with a 4chan data source, it should start collecting data from that point onward (note that you do have to index the posts for fast text search first).

If you want historical data, you'll have to import archives. 4plebs has some recent archives that you can import with this script.

Let me know if this answers your question!

agrizzli commented 3 years ago

Great! Thank you very much!