dspinellis / alexandria3k

Local relational access to openly-available publication data sets
GNU General Public License v3.0
78 stars 14 forks source link

Possible non-random sampling #18

Closed austinjp closed 11 months ago

austinjp commented 11 months ago

Hi there. Firstly, thanks for a3k, I'm finding it very useful.

I noticed a problem when using --sample 'random.random() < 0.0001' to randomly sample from the latest Crossref dataset. It seemed to produce identical samples each time, whereas I was expecting it to produce different samples each time. I've not yet looked through the code, but I wondered if it might be an issue with seeding the random generator? Perhaps this is expected behaviour, so apologies if I missed this in the docs.

An example:

$ a3k populate --sample 'random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id     doi                          title                                                                                     
-----  ---------------------------  ------------------------------------------------------------------------------------------
0      10.1007/978-3-658-29701-5_1  Keynote Speech Disruption in mobility – new trends, new concepts and new business models?!
21383  10.18356/98a0368f-en-fr      No. 47244 International Bank for Reconstruction and Development and Brazil

$ rm /tmp/crossref.db

$ a3k populate --sample 'random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id     doi                          title                                                                                     
-----  ---------------------------  ------------------------------------------------------------------------------------------
0      10.1007/978-3-658-29701-5_1  Keynote Speech Disruption in mobility – new trends, new concepts and new business models?!
21383  10.18356/98a0368f-en-fr      No. 47244 International Bank for Reconstruction and Development and Brazil

Notice the identical results after deleting and recreating the database with a 'fresh' sample. Perhaps this is expected behaviour, but I was expecting a random sample, and hence different each time.

Some quick sanity checks:

$ sqlite3 -batch /tmp/crossref.db 'select count(*) from works;'
count(*)
--------
10000

$ ls -l data-files/crossref/ | head -n 4
total 185934MB
-rwxrwxrwx 1 austinjp austinjp  8MB 2023-08-10 19:35 0.json.gz
-rwxrwxrwx 1 austinjp austinjp 11MB 2023-08-10 20:10 10000.json.gz
-rwxrwxrwx 1 austinjp austinjp  7MB 2023-08-10 20:11 10001.json.gz

$ ls -1 data-files/crossref/ | wc -l
28702

Workaround

As a workaround, I use --sample '( random.seed() ) or random.random() < 0.0001' to re-seed the random generator at every sample decision. It's inefficient, but it gives the results I'd expected:

$ rm /tmp/crossref.db

$ a3k populate --sample '( random.seed() ) or random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';

id     doi                               title                                                                  
-----  --------------------------------  -----------------------------------------------------------------------
0      10.1097/00001721-199206000-00004  Protein S negates the activated protein C inhibitory activity of plasma
37767  10.1177/109980040000200201        Summer Camp for Scientists

$ rm /tmp/crossref.db

$ a3k populate --sample '( random.seed() ) or random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';

id     doi                            title                                                                                                                     
-----  -----------------------------  --------------------------------------------------------------------------------------------------------------------------
0      10.15296/ijwhr.2017.33         Health Promoting Behaviors and Self-efficacy of Physical Activity During Pregnancy: An Interventional Study               
21383  10.1016/s0973-0826(08)60299-9  Risk and uncertainty analysis of natural environmental assets threatened by hydropower projects: case study from Sri Lanka

Best wishes.

dspinellis commented 11 months ago

Thank you for your kind words! The random number generator is indeed seeded with a constant at the beginning of the program's operation so that a3k results are repeatable. The workaround you suggest should help with your process. We could add an option to set the seed (or not set it at all) if you think this is an important feature.

Looking forward to read about your results.

austinjp commented 11 months ago

Hi again. No problem! :smiley: I just had a look through the code and yep, I spotted the deterministic seeding. I appreciate that it might be useful, I just wasn't anticipating it. Perhaps the docs could highlight the fact that the sampling is deterministic? I'll send a PR, feel free to use/ignore as you see fit.

My workaround 'works', although it's inefficient. I guess that's not really a problem in reality, since it's plenty fast enough for my needs. A CLI flag for setting a fresh seed every invocation might be good, though, since it would allow users to set the seed themselves and hence have more control. But this is more of a feature request than an issue, so I'm happy for this to be closed.