More than 10,000 claims with the claim field empty - Also problems with 'author.name' and author.sameAs'

vittorianovancini commented 3 years ago

Fandango has now more than 17 thousands ingested claims, but if you sort them in the Dashboard, more than 10,000 have the claim/text field empty

There are also evident problems with the author.name and author.sameAs. In this example, Matteo Renzo is indicated as the author of Fact-checking and there is a video in the other field...

pstalidis commented 3 years ago

claims and claim reviews are ingested by crawling the list of fact checkers. in every link that is found inside these sources an attempt is made to extract claims and reviews that follow the claimReview schema. whatever is extracted is then stored inside the database. we are not verifying the validity of what each data source is putting inside the schema fields. if a source leaves a field (like "claimReviewed" which is where the "text" of the claim entity is filed) empty then the resulting database entry will be empty.

do you want me to delete all claims that have an empty "text" ?

vittorianovancini commented 3 years ago

I think this might be the right solution, because I suspect this has a very bad influence on the system (what do you think Neal?) but I'm discussing this with the other end user - since this means losing 2/3 of the claims - and we'll tell you know tomorrow. Meanwhile, thanks.

pstalidis commented 3 years ago

I am adding an example screenshot of the provided data from pagellapolitica.it for your consideration.

As you can see, the "claimReviewed" field is an empty string, so the "claim.text" is empty.

Also, you can see that Vittorio Grilli is indicated as the author and there is a youtube video on the other field...

mmagaldi-eng commented 3 years ago

I think that the only solution here is that end-users check and clean all claims. @vittorianovancini, my suggestion for end-users is to start checking the earliest and the latest claim of each site, just to have an idea of the claim review data quality.

vittorianovancini commented 3 years ago

Nice screenshot. And when was these missing parts were discovered, exactly? Because this is a major bug, not an enhancement of the system when two thirds of the records have the most important data missing, the claim. Ten thousands out of 17,000. Consistency of the data control comes before the data quality check, or at least this is what we do at ANSA. Besides, we, end users, were told several times in several plenaries that the claims ingestion was working fine, actually better than the others. And now we'll have to do this check manually, one by one, to avoid that incomplete data can hinder the proper functioning fo the Claims Tool. Ok, we'll do it, but this is not right, at all.

mmagaldi-eng commented 3 years ago

We have no time, now.

One alternative solution is to investigate if we can reingest raw data (if we have them stored somewhere) and remapping fields in a different way to get more effective claims.

In the meantime, my suggestion for the end-users is to focus on specific use-cases, check the related claims, and fix them manually, if needed.

pstalidis commented 3 years ago

And when was these missing parts were discovered, exactly?

2 years ago. We discussed that each "fact checker" has a different way to present their information but google has been pushing for everyone to follow the "claimReview" schema and that we would only gather claims and claim reviews that follow this specific schema.

Because this is a major bug, not an enhancement of the system

A bug implies that the code does something different than intended. This code is designed to extract values from specific fields from a web page and put them in specific fields in the database. The goal of this code is completed exactly as intended. So, this is not a bug. If we want different behaviour (changing the algorithm to achieve something else) such as: if a field is empty do not insert the item in the database, then we mark it as an enhancement of the code

Consistency of the data control comes before the data quality check, or at least this is what we do at ANSA.

I can only speak for the consistency of our code and our code is consistent because if some type of information exists where it is supposed to be then it is read correctly and put in the correct place, every time. If a website is not consistent with the protocol that they say that they are following, then there is nothing more that we can do. They should fix their code.

Besides, we, end users, were told several times in several plenaries that the claims ingestion was working fine, actually better than the others.

And it is working fine, bringing in the database all of these claims.

Now, if you want me to add a filter and discard claims with empty text I can do that. I can also change the mapping and put information that exists in a different field.

What I cannot do is to find information that does not exists.

Please let me know

vittorianovancini commented 3 years ago

To have the filter to discard the empty claims would be the best solution, thank you. But then we will have to load other claims, maybe with the same solution we experimented for the 450 claims you ingested last month. Later today, I will let you know what we have decided after I hear the other end user, but I think they will confirm the cancellation.

pstalidis commented 3 years ago

I removed all claims (and their reviews) that have an empty text. There are now 6.5k claims. The claims with the empty text remain in my own database should we need them again.

I have started the claim crawler again to gather more claims (without the filter, so claims with empty text might appear again but I will delete them again when necessary)

fandangOrg / fandango

More than 10,000 claims with the claim field empty - Also problems with 'author.name' and author.sameAs' #116