digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.
Other
246 stars 59 forks source link

Map items to objects instead of dicts #409

Closed stijn-uva closed 7 months ago

stijn-uva commented 8 months ago

This changes all map_item() methods so they return a MappedItem() object instead of a dict. The dict can be retrieved via MappedItem.get_item_data(). A message argument can additionally be passed to the object's constructor which is then available via get_message() (defaults to an empty string). Existing code that uses map_item() has been adjusted to deal with this change.

This allows us to attach a warning to a mapped item, and these warnings can then be aggregated by e.g. import workers or processors to let the user know their data could be passed well enough to use it, but not perfectly. This functionality is now used in two places:

Other processors can choose to use the new iterate_mapped_objects() method instead of iterate_mapped_items(). This yields the full MappedItem object so that any warnings can be processed additionally.

This does not allow a processor to know exactly why mapping raised a warning, except by parsing the passed warning message (which seems ill-advised). The class could be extended to allow storing more information for mapped items, e.g. a list of field names that were missing, or the original un-mapped item. I've kept the class minimal for now to make the general idea clearer.

dale-wahl commented 7 months ago

Per our convo, I think this is definitely an improvement and anything that allows us to inform the user when we've modified data or made an assumption is a very good improvement. Bringing attention to the fact that "export as csv" and any processor running on JSON data is by definition modified in some way via map_item is a good improvement. I recently added tooltips for the "Preview" and "Export as CSV" buttons with this in mind (https://github.com/digitalmethodsinitiative/4cat/commit/c0aa4c75a40c0f1316d1440cf61039c7371803ec).

I could not immediately see how to check a given field is modified which I would like to do in certain processors. I dreamed up this:

class FourcatMissingData(str):
    pass

item = FourcatMissingData()

item will now act as a string (by default ""), you can write it to a CSV, concat it or whatever other str operations, but item is str is now False and we can check isinstance(thing, FourcatMissingData) to verify if we are dealing with missing data. Then in map_item methods, we can actually verify if fields exist in the data (i.e., item.get('category', FourcatMissingData()) will differentiate between item = {"category" : ""} (category is "") and item = {} (we do not know what category is)).

Ultimately, I would not inherent string (i.e., class FourcatMissingData: pass) and force development with missing data in mind (e.g., processors should handle missing data and then convert to "" if desired or ignore the row or whatever the processor in question ought to do with missing data and inform the user at that point).

# We could even add a default value to use in more general processors (e.g., instead of always "" when writing to CSV)
class FourcatMissingData:
    def __init__(self, default=None):
        self.default = default

But if we want to assign missing data to "" as a convention, a class inheriting str is sort of the best of both worlds.

stijn-uva commented 7 months ago

What is the range of options we have when we encounter an item with a missing/incomplete field?

  1. Ignore the whole item (it's incomplete so it shouldn't count for our analysis)
  2. Pretend the missing field has some default value and process as normal (e.g. an empty string for text fields, or a 0 for numbers, or 'unknown' for categories)

I can't think of other options, except perhaps trying to parse the underlying data in a different way to find the relevant value - but that should probably be done in map_item() itself.

If these are the only two (or generally if the range of options is somewhat limited), I think maybe we could do this in iterate_items, along the lines of:

for item in self.source_dataset.iterate_items(required=["bookmark_count"], strategy="ignore")

Or something along those lines. This way we don't have to re-implement error handling for each processor, which I'm afraid will be somewhat error-prone, or easy to 'leave for later' when under time pressure. It wouldn't be able to automatically raise warnings this way though, but we could refactor it into self.dataset.iterate_source_items to give it a direct reference to the dataset to call update_status() on, for example.

This could still use a MappedItem or FourcatMissingData(scalar_type) under the hood, but the processor code wouldn't have to make the distinction between those and 'actual' items, it could leave that to the backend.

dale-wahl commented 7 months ago

If we need to handle missing data via one (or many?) iterate_item methods, then yes, I think we could do something like that. I'm envisioning pseudo code like:

def iterate_items(strategy=None):
    for item in items:
        if strategy == "coerce":
            for key, value in item.items():
                if type(value) is FourcatMissingData:
                    item[key]= value.default # this being set in `map_item`

        elif strategy == "ignore":
            if any([type(value) is FourcatMissingData for value in item.values()]):
                continue

        yield item

So you could either skip or coerce values else return the FourcatMissingData object to be handled by the processor. This could give us time to properly handle missing data in the processors which I believe is key.

The required is to only check certain fields and thus speed things up?

To answer your other question and perhaps better explain my point...

The range of options on handling missing data is related to why the data is missing and the type of analysis being done. Here's a brief article that kind of overviews where I'm coming from. Because 4CAT uses standardized analyses and deciding between missing data strategies require real thought, we likely should not implement them. As an example, if I want to estimate the impressions Elon's tweets generated in a dataset and some of this tweets are missing the impressions count, to fill those gaps (which are almost certainly not zero or easily filled in map_item) I could do a lot of things. I could use an overall average, or I might see a general increase over time and use a rolling average, or I might compare it to number of followers at the time of the tweet, number of retweets, etc and use a regression model to estimate the missing impressions. Whatever choice I make would need to be explained in my methodology and I be outside the scope of 4CAT. BUT I would like to know that those data points are missing if I export my data to a CSV from 4CAT. 4CAT's answer would probably have to be to sum only the known values and present that as the absolute minimum, but not a good estimate (depending on number of missing data points).

If I run rank-attributes on a TikTok categorized by "location_created", I want to categorize blank values differently from missing values. It seems like we currently drop both from the analysis.

I would say that our missing values are probably not random and that says something specific about those items missing them. They are likely missing data either because:

All that is to say, I think it depends heavily on the processor.

I looked at how pandas handles missing data as I used that for most of the analyses when I got started. They have a whole strategy for dealing with missing data using different "None" types. Their NaN (not a number) evaluates differently depending on the calculation being done (e.g., a NaN value for sum is 0, but for product is 1). They have various functions to convert columns into types which can either raise errors, coerce, or ignore. But even ignoring (which is being deprecated) just returned the original data (e.g., pandas.is_numeric("string", errors="ignore") returns "string" which we would then have to handle). I'm not proposing we do something like that for 4CAT, but I am using it to illustrate that I'm not crazy for thinking this much about the problem. You'll be happy to know that pandas to_csv method has an attribute na_rep that replaces NA/NaN/etc with, by default, "".

stijn-uva commented 7 months ago

MappedItem objects will now check if any of the data dictionary's items are of the class MissingMappedField upon creation. If so, a list of these fields is stored in MappedItem.missing (or an empty list if no fields are like this).

def map_item(item):
    # ...
    return MappedItem({
        "some_field": MissingMappedField("")
    })

Added an argument map_missing to iterate_mapped_items and iterate_mapped_objects. It takes the following values:

This strategy can be chosen for all missing fields (by passing the above as the argument value) or per field, by passing a dictionary with an explicit strategy per field:

self.iterate_mapped_items(processor, map_missing="default")
self.iterate_mapped_items(processor, map_missing={"num_likes": "abort"})

Note that this only does something if fields are explicitly marked as missing by the map_item method of the relevant processor. Currently, only the twitterv2 data source/processor does this. In all other cases, missing data is handled in map_item itself. Whether it makes sense to do it there or to outsource it to iterate_mapped_objects depends on whether the right strategy can be different depending on the purpose of the data (e.g. when determining the average of a range of items you may want to ignore a value, but you want to include it as an empty string when counting the amount of total items).