dspinellis / alexandria3k

Local relational access to openly-available publication data sets
GNU General Public License v3.0
79 stars 14 forks source link

Add optional datatype to ColumnMeta #35

Closed BasVerlooy closed 6 months ago

BasVerlooy commented 6 months ago

Adding the data type to ColumnMeta allows for better filtering of a dataset, currently all fields don't have a datatype listed. For Pubmed this means every column is a string, filtering by year for example is now only possible like this year = '2022'. Adding data types to columns of which we know they are integers allows for filtering with a range, like year > 2000

dspinellis commented 6 months ago

Nice idea! I remember this has bit me a number of times in the past, which you can see if you git grep -i Cast. However, now I can't seem to be able to replicate the problem. See the example below from examples/cdindex:

sqlite> .schema works
CREATE TABLE works(
published_year,
id,
published_day,
doi,
published_month
);

sqlite> select * from works where published_year >2010 limit 5;
2011|378||10.7202/1063035ar|
2011|493||10.7202/1063027ar|
2011|585||10.4000/books.apu.8998|
2011|666||10.7202/1063094ar|
2011|694|1|10.18146/tmg.584|6

Adding a data type to an SQLite column, has only a very small effect on its behavior. Could the type only affect the Perl interface? Can you please give me a simple before/after example so I can understand what is going on?

dspinellis commented 6 months ago

You are right! Pubmed requires this fix. Crossref doesn't, because its data are in JSON format, which distinguishes between numbers and strings. In the Pubmed XML everything is a string. Similar fixes are required for all numbers obtained from XML data. Well done and thank you for your patience!

$ bin/a3k query crossref tests/data/crossref-sample/ --query 'SELECT doi FROM works WHERE published_year BETWEEN 2018 AND 2020'  10.46692/9781529202618.006
10.1145/3196398.3196476

$ bin/a3k query pubmed tests/data/pubmed-sample/ --query 'SELECT doi FROM pubmed_articles WHERE pubmed_articles.revised_year BETWEEN 2018 AND 2020'