DDMAL / linkedmusic-datalake

0 stars 4 forks source link

Simssa DB data processing #75

Open Yueqiao12Zhang opened 6 days ago

Yueqiao12Zhang commented 6 days ago

This issue starts with retrieving the Simssa DB data dumps. Can I use the files in https://github.com/ELVIS-Project/simssadb/tree/develop/sample_data_for_SIMSSA_DB? If yes, what specific files should be used?

Yueqiao12Zhang commented 6 days ago

In the original folder, there is a left over SQL dump. I did not find the SQL dump in the simssa DB github website, but I found a few CSVs (seems to be small snippets). For example, https://github.com/ELVIS-Project/simssadb/blob/develop/sample_data_for_SIMSSA_DB/JLSDD/JLSDD%20(corr%20IL).csv. Which data dumps are correct?

fujinaga commented 1 day ago

Because we are just experimenting at this stage, you can use any or all of the available data. You can certainly start with https://github.com/ELVIS-Project/simssadb/blob/develop/sample_data_for_SIMSSA_DB/JLSDD/JLSDD%20(corr%20IL).csv and see what kind of issues are involved.

Yueqiao12Zhang commented 1 day ago

In the JLSDD CSV, there's no IDs for any attribute including the key attribute. In the SQL dump, there is at least one ID for all tables to connect them to other tables. If I start by using this sample, then the reconciliation step would be different when we move on to the real set of data.