Closed stijn-uva closed 7 months ago
Per our convo, I think this is definitely an improvement and anything that allows us to inform the user when we've modified data or made an assumption is a very good improvement. Bringing attention to the fact that "export as csv" and any processor running on JSON data is by definition modified in some way via map_item
is a good improvement. I recently added tooltips for the "Preview" and "Export as CSV" buttons with this in mind (https://github.com/digitalmethodsinitiative/4cat/commit/c0aa4c75a40c0f1316d1440cf61039c7371803ec).
I could not immediately see how to check a given field is modified which I would like to do in certain processors. I dreamed up this:
class FourcatMissingData(str):
pass
item = FourcatMissingData()
item
will now act as a string (by default ""
), you can write it to a CSV, concat it or whatever other str
operations, but item is str
is now False
and we can check isinstance(thing, FourcatMissingData)
to verify if we are dealing with missing data. Then in map_item
methods, we can actually verify if fields exist in the data (i.e., item.get('category', FourcatMissingData())
will differentiate between item = {"category" : ""}
(category is ""
) and item = {}
(we do not know what category is)).
Ultimately, I would not inherent string (i.e., class FourcatMissingData: pass
) and force development with missing data in mind (e.g., processors should handle missing data and then convert to ""
if desired or ignore the row or whatever the processor in question ought to do with missing data and inform the user at that point).
# We could even add a default value to use in more general processors (e.g., instead of always "" when writing to CSV)
class FourcatMissingData:
def __init__(self, default=None):
self.default = default
But if we want to assign missing data to ""
as a convention, a class inheriting str
is sort of the best of both worlds.
What is the range of options we have when we encounter an item with a missing/incomplete field?
I can't think of other options, except perhaps trying to parse the underlying data in a different way to find the relevant value - but that should probably be done in map_item()
itself.
If these are the only two (or generally if the range of options is somewhat limited), I think maybe we could do this in iterate_items
, along the lines of:
for item in self.source_dataset.iterate_items(required=["bookmark_count"], strategy="ignore")
Or something along those lines. This way we don't have to re-implement error handling for each processor, which I'm afraid will be somewhat error-prone, or easy to 'leave for later' when under time pressure. It wouldn't be able to automatically raise warnings this way though, but we could refactor it into self.dataset.iterate_source_items
to give it a direct reference to the dataset to call update_status()
on, for example.
This could still use a MappedItem
or FourcatMissingData(scalar_type)
under the hood, but the processor code wouldn't have to make the distinction between those and 'actual' items, it could leave that to the backend.
If we need to handle missing data via one (or many?) iterate_item
methods, then yes, I think we could do something like that. I'm envisioning pseudo code like:
def iterate_items(strategy=None):
for item in items:
if strategy == "coerce":
for key, value in item.items():
if type(value) is FourcatMissingData:
item[key]= value.default # this being set in `map_item`
elif strategy == "ignore":
if any([type(value) is FourcatMissingData for value in item.values()]):
continue
yield item
So you could either skip or coerce values else return the FourcatMissingData
object to be handled by the processor. This could give us time to properly handle missing data in the processors which I believe is key.
The required
is to only check certain fields and thus speed things up?
To answer your other question and perhaps better explain my point...
The range of options on handling missing data is related to why the data is missing and the type of analysis being done. Here's a brief article that kind of overviews where I'm coming from. Because 4CAT uses standardized analyses and deciding between missing data strategies require real thought, we likely should not implement them. As an example, if I want to estimate the impressions Elon's tweets generated in a dataset and some of this tweets are missing the impressions count, to fill those gaps (which are almost certainly not zero or easily filled in map_item
) I could do a lot of things. I could use an overall average, or I might see a general increase over time and use a rolling average, or I might compare it to number of followers at the time of the tweet, number of retweets, etc and use a regression model to estimate the missing impressions. Whatever choice I make would need to be explained in my methodology and I be outside the scope of 4CAT. BUT I would like to know that those data points are missing if I export my data to a CSV from 4CAT. 4CAT's answer would probably have to be to sum only the known values and present that as the absolute minimum, but not a good estimate (depending on number of missing data points).
If I run rank-attributes
on a TikTok categorized by "location_created", I want to categorize blank values differently from missing values. It seems like we currently drop both from the analysis.
I would say that our missing values are probably not random and that says something specific about those items missing them. They are likely missing data either because:
All that is to say, I think it depends heavily on the processor.
I looked at how pandas
handles missing data as I used that for most of the analyses when I got started. They have a whole strategy for dealing with missing data using different "None" types. Their NaN (not a number) evaluates differently depending on the calculation being done (e.g., a NaN value for sum is 0
, but for product is 1
). They have various functions to convert columns into types which can either raise
errors, coerce
, or ignore
. But even ignoring (which is being deprecated) just returned the original data (e.g., pandas.is_numeric("string", errors="ignore")
returns "string"
which we would then have to handle). I'm not proposing we do something like that for 4CAT, but I am using it to illustrate that I'm not crazy for thinking this much about the problem. You'll be happy to know that pandas to_csv
method has an attribute na_rep
that replaces NA/NaN/etc with, by default, ""
.
MappedItem
objects will now check if any of the data dictionary's items are of the class MissingMappedField
upon creation. If so, a list of these fields is stored in MappedItem.missing
(or an empty list if no fields are like this).
def map_item(item):
# ...
return MappedItem({
"some_field": MissingMappedField("")
})
Added an argument map_missing
to iterate_mapped_items
and iterate_mapped_objects
. It takes the following values:
default
: Default, use the value passed to the constructor of MissingMappedField
(e.g. ""
in the example above)abort
: Raise the new exception MappedItemIncompleteException
when a missing field is encounteredMissingMappedField()
, for example.This strategy can be chosen for all missing fields (by passing the above as the argument value) or per field, by passing a dictionary with an explicit strategy per field:
self.iterate_mapped_items(processor, map_missing="default")
self.iterate_mapped_items(processor, map_missing={"num_likes": "abort"})
Note that this only does something if fields are explicitly marked as missing by the map_item
method of the relevant processor. Currently, only the twitterv2
data source/processor does this. In all other cases, missing data is handled in map_item
itself. Whether it makes sense to do it there or to outsource it to iterate_mapped_objects
depends on whether the right strategy can be different depending on the purpose of the data (e.g. when determining the average of a range of items you may want to ignore a value, but you want to include it as an empty string when counting the amount of total items).
This changes all
map_item()
methods so they return aMappedItem()
object instead of a dict. The dict can be retrieved viaMappedItem.get_item_data()
. Amessage
argument can additionally be passed to the object's constructor which is then available viaget_message()
(defaults to an empty string). Existing code that usesmap_item()
has been adjusted to deal with this change.This allows us to attach a warning to a mapped item, and these warnings can then be aggregated by e.g. import workers or processors to let the user know their data could be passed well enough to use it, but not perfectly. This functionality is now used in two places:
DatasetMerger
processor, which will log warnings to the dataset log, and finish with a status that includes the amount of warnings (see #408).Search
class, which will log warnings when importing a dataset (e.g. when uploading via Zeeschuimer).Other processors can choose to use the new
iterate_mapped_objects()
method instead ofiterate_mapped_items()
. This yields the fullMappedItem
object so that any warnings can be processed additionally.This does not allow a processor to know exactly why mapping raised a warning, except by parsing the passed warning message (which seems ill-advised). The class could be extended to allow storing more information for mapped items, e.g. a list of field names that were missing, or the original un-mapped item. I've kept the class minimal for now to make the general idea clearer.