digitalmethodsinitiative / zeeschuimer

A browser extension to collect social media data with.
Other
184 stars 14 forks source link

Unexpected Data Format during capturing Instagram Data #27

Closed BergJakob closed 9 months ago

BergJakob commented 9 months ago

Thanks for the new release, scrolling and capturing data from specific profiles does work again. But when I transfer them to 4CAT, in the results cell the following statement can be read: "Unexpected data format for XX items. All data can be downloaded, but only data with expected format will be available to 4CAT processors; check logs for details. XX items captured." When going into detail it seems like that somehow in the ndjson-file, the format can't be recognized.

Is this a local problem on my machine or a general bug?

Thanks in advance!

BergJakob commented 9 months ago

Here are the log details:

Fri Feb 23 07:16:46 2024: Processing 'Import scraped Instagram data' started for dataset 2f52d64e80268333691f61f53f07f8df Fri Feb 23 07:16:46 2024: Processing data Fri Feb 23 07:16:48 2024: Writing collected data to dataset file Fri Feb 23 07:16:48 2024: MapItemException (item 0): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 1): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 2): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 3): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 4): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 5): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 6): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 7): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 8): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 9): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 10): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 11): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 12): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: MapItemException (item 13): Unable to map item: KeyError-'full_name' Fri Feb 23 07:16:48 2024: Query finished, results are available. Fri Feb 23 07:16:48 2024: Unexpected data format for 14 items. All data can be downloaded, but only data with expected format will be available to 4CAT processors; check logs for details

stijn-uva commented 9 months ago

Hi @BergJakob, thanks for the report. Would you be able to send the NDJSON export file to me so I can take a closer look? A fix would depend on where the data is coming from exactly so it would help to be able to look at the data. You can post it here as an attachment or e-mail me at [e-mail] if you'd rather not post it publicly.

BergJakob commented 9 months ago

I`ll contact you via Email in a few mins. Thank you for the quick response.

stijn-uva commented 9 months ago

Thanks for forwarding the data! The issue is fixed in 4CAT; Instagram changed their data structure recently and 4CAT needs to be updated accordingly. We will be releasing a new version early next week, so if you then update 4CAT you should be able to properly upload the data and export it to CSV, et cetera.

BergJakob commented 9 months ago

Thank you!

BergJakob commented 8 months ago

Hi @stijn-uva, is it already forseeable when the new 4CAT-version will be published? I need to scrape data for a paper and therefore I have to make some timeframe arrangements according to when I can use 4CAT again. Thanks so much in advance!

dale-wahl commented 8 months ago

You can use the “latest” Docker tag in the .env file to use the most current version of 4cat including the above fix if you cannot wait for the release.

stijn-uva commented 8 months ago

Hi @BergJakob, this was delayed a bit for various reasons, but version 1.40 is now available: https://github.com/digitalmethodsinitiative/4cat/releases/tag/v1.40