UniStuttgart-VISUS / damast

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World" (VolkswagenFoundation)
MIT License
10 stars 1 forks source link

"NULL" vs. empty in column "comment" of Place table #92

Closed tutebatti closed 2 years ago

tutebatti commented 2 years ago

I'm currently looking through the Place table, especially to check for major inconsistencies in the column comment, which is part of the Place URI page. On the one hand, there were 3 or for rows with linebreaks. On the other hand, some rows have nothing in the column comment, some have NULL. Is that normal?

Note: I am editing a csv downloaded via pgAdmin in LibreOffice Calc.

tutebatti commented 2 years ago

Also, since I do not need to go over all entries, maybe @mfranke93 could help with the respective SQL query to get only places from the DhiMu data. Thank you!

mfranke93 commented 2 years ago

On the other hand, some rows have nothing in the column comment, some have NULL. Is that normal?

In general, that is a remainder of the data from the old MySQL database and the old GeoDB-Editor implementation. Be glad that the entries with the string "null" have all been replaced by now ;) The frontend code should handle empty strings and a null value the same in all places, so this should not be visible to users.

Also, since I do not need to go over all entries, maybe @mfranke93 could help with the respective SQL query to get only places from the DhiMu data. Thank you!

I'd rather not repeat all the logic going on in the DaRUS export script in a single SQL query. There are a lot of tables involved. But this should do the trick (based on the data export from January 31):

SELECT * FROM place
WHERE id <= 842
AND id NOT IN (277, 301, 418, 519, 711, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 771, 783, 786, 797, 811);
tutebatti commented 2 years ago

Thank you.

I see the problem with the SQL queries now, it's not straight forward, because the data itself is not 100% consistent in that respect.

I will discuss with @rpbarczok if the values "NULL" and "" should still be aligned, if it's not a lot of work. Our data should look clean in the Repository as well for people that will work with them outside of the Visualiziation.

mfranke93 commented 2 years ago

because the data itself is not 100% consistent in that respect

What do you mean by that?

What I meant here:

I'd rather not repeat all the logic going on in the DaRUS export script in a single SQL query.

was that the query would be very large and complex, and I'd rather not spend the time to create and debug it. I did not imply anything about data quality.

tutebatti commented 2 years ago

What I mean ist: If you could simply use the equivalent of the "DhiMu" tag in the visualization to exclude all data which is not part of DhiMu - instead of listing individual places - that would be more straight forward. It seems to me that this is due to the two data sets - DhiMu and eOC - being entangled.

mfranke93 commented 2 years ago

I see what you mean. Yes, that would probably work for the visualization side.

I tried it out with this query, and it works:

WITH place_ids AS (
    SELECT DISTINCT P.id AS id
    FROM place P
    JOIN place_instance PI ON PI.place_id = P.id
    JOIN evidence E ON E.place_instance_id = PI.id
    JOIN tag_evidence TE ON E.id = TE.evidence_id
    JOIN tag T ON TE.tag_id = T.id
    WHERE T.tagname = 'DhiMu'
)
SELECT * FROM place WHERE id IN (SELECT id FROM place_ids);

There is one issue, though, and something you and @rpbarczok should discuss: This comes up with 450 places. My previous query comes up with ~850. The dump to DaRUS, at the moment, still contains places for which the dump does not contain evidences. About 400 of them, according to these results. It is up to you to decide if they should stay in there or not. Whatever your decision, let me know in #3.

tutebatti commented 2 years ago

When going through the data now, we can unify these instances. Although the software treats "" (empty string) and NULL the same, is there any preferred way, @mfranke93?

mfranke93 commented 2 years ago

I would say no content is better represented by null.

tutebatti commented 2 years ago

@rpbarczok We have an answer now. Thinking about it, I'm not sure how to implement this best. As long as only people with technical understanding work with the csv files straight from the db, using null will not cause problems. What do you think?

mfranke93 commented 2 years ago

As long as only people with technical understanding work with the csv files straight from the db, using null will not cause problems.

I have to ask: where would "they" get those CSV files from?

Also, most "sane" database systems would not put the string "null" into CSVs on export, but rather leave it empty. JSON is a different matter, but that has a dedicated null data type.

tutebatti commented 2 years ago

Also, most "sane" database systems would not put the string "null" into CSVs on export

That's an important information and I wondered, why the csv files do not show that value. So, importing a csv to the db, the empty strings of the csv will become null again?

where would "they" get those CSV files from?

They would certainly not get them directly. However, formatting/editing on a large scale as we are doing right now is done more easily with a csv file than editing individual entries via the geo-db. Also, @rpbarczok just prepared a table (.xls?) as a blueprint for a student assistant to put new entries into the db (via the geo-db). Here, the same holds true: repeated entries are more easily handled with a regular spread sheet gui. Of course, there needs to be post-processing of the resulting csv file for a possible import to the db. The critical operations should and will be carried out by technically skilled persons. At any rate, these are all just thoughts. ;)

mfranke93 commented 2 years ago

That's an important information and I wondered, why the csv files do not show that value. So, importing a csv to the db, the empty strings of the csv will become null again?

Good question. I don't know. CSV is a rather simplistic format, so there is some loss of information regarding data types here on export, and it depends on the database in question what happens on import. If CSV is the "common denominator" for data exchange (not a good idea IMO, but works for people who rather do editing in Excel etc.), I would just establish that "" and NULL have the same semantics.

They would certainly not get them directly. However, formatting/editing on a large scale as we are doing right now is done more easily with a csv file than editing individual entries via the geo-db. Also, @rpbarczok just prepared a table (.xls?) as a blueprint for a student assistant to put new entries into the db (via the geo-db). Here, the same holds true: repeated entries are more easily handled with a regular spread sheet gui. Of course, there needs to be post-processing of the resulting csv file for a possible import to the db. The critical operations should and will be carried out by technically skilled persons. At any rate, these are all just thoughts. ;)

I see. Well, see above: empty string and null then mean the same thing. I would not recommend putting the string NULL anywhere.

tutebatti commented 2 years ago

Well, see above: empty string and null then mean the same thing. I would not recommend putting the string NULL anywhere.

I will tattoo that on my forehead. Or at least my underarm. ;)

so there is some loss of information regarding data types here on export, and it depends on the database in question what happens on import

I hope this will not get into our way now. As I said (and as I understood your advice for a possible work flow), we are editing data on a large scale right now using csv files. @rpbarczok wrote the respective sql queries, preserving the relevant columns with the keys and we hope you can import this data back into the db. :grimacing:

mfranke93 commented 2 years ago

I hope this will not get into our way now. As I said (and as I understood your advice for a possible work flow), we are editing data on a large scale right now using csv files. @rpbarczok wrote the respective sql queries, preserving the relevant columns with the keys and we hope you can import this data back into the db. :grimacing:

I don't think it will. Doing this in bulk is the way to go, and as long as you check the data before you (or I) re-incorporate it into the DB, there should not be any issues.

tutebatti commented 2 years ago

Great. There were some issues as to how different GUIs handle the csv. E.g., LibreOffice Calc does not print "" as last column, if that is empty, but rather prints nothing after the last comma. In general, that's nothing we can handle. I would consider this "issue" as resolved. Thank you!