Open jhpoelen opened 2 months ago
Hi @jhpoelen I am wondering if we got the same output from bif. I have 5,446,329 interaction records and it took 5710.24s user 1120.76s system 66% cpu 2:51:39.66 total
@seltmann nice!
Interesting to see that I got a different result . . . with ~16M interaction records after about 5 hours of processing. 10M records remained after removing duplicate datasets (e.g., datasets indexed by both iDigBio and GBIF).
Can you share a list of unique citations to check whether we ended up using the same datasets?
cat interactions.tsv\
| mlr --tsvlite cut -f citation\
| uniq\
| sort\
| uniq
Hi @jhpoelen There were 3,516,157 unique citations. See https://drive.google.com/file/d/1tYp26ZcfUEVDRMa_KUPzgyvwrfnXswQL/view?usp=drive_link
Hey @seltmann - thanks for sharing your list of unique "referenceCitation" values. I was hoping to compare the list of unique "citation" values instead. This list should be in the order of 100k or less. Alternatively, perhaps to save back and forth, please share your interactions.tsv, so I can have a look at your source data instead.
PS Note that referenceCitations are the per-claim citations. "Citation" values refer to the dataset under review.
I do realize that the names "referenceCitation" and "citation" are pretty similar and potentially confusing . Perhaps "claimCitation" and "datasetCitation" would be more intuitive. Open to suggestions. . .
Here is the interactions.tsv https://drive.google.com/file/d/1qbsnK2_3KPCY7foCPSyO9Ujp661caUPt/view?usp=drive_link
I was looking for similar headers as found in the GloBI interactions.tsv file. Dataset citation/claim citation are more intuitive.
@jhpoelen Also, would you be able to share your interactions.tsv output
@seltmann for sure! Thought you'd never ask ; ) Here's a temporary download link for my interactions.tsv.gz https://send.tresorit.com/a#2wa6zr8XkuayhqmBV4kFeQ . Please confirm receipt.
I did a mlr --tsvlite mlr cut -f citation | sort | uniq
on both our interactions.tsv and attached the results.
with
$ wc -l bif-jorrit/citations-uniq.txt
1011 bif-jorrit/citations-uniq.txt
and
$ wc -l bif-katja/citations-uniq.txt
407 bif-katja/citations-uniq.txt
Hmm. Looks like your run may have been stopped prematurely. Did you perhaps go home after 2.5 hours and close your laptop and stop the process?
Probably a good reason to add some more (automated) info on the origin of the data by documenting what script was executed when, and what output they produced. . .
@jhpoelen received!
By the way, I am trying to restructure Elton to allow for "streaming" reviews. This way, we can see how many potential interaction records are not included due to unknown interaction types. Hoping to make some progress on this . . .
allowing for workflows like:
preston head\
| preston cat\
| elton review\
> review.tsv
to review the latest version of some Preston archive.
After some hacking . . . I was able to create a first pass at streaming reviews with
./find-datasets-with-interactions.sh\
| grep "hash://sha256/9cd053d40ef148e16389982ea16d724063b82567f7ba1799962670fc97876fbf"\
| head -1\
| elton stream --record-type review\
> review.tsv.txt
producing
with the first bunch of records indicating that UCSB-IZC makes . . . biotic interaction claims . . . how about that!
A current review of candidate interaction record can be found in
Seltmann, K., Poelen, J. H., & Augusto Salim, J. (2024). A Review of Candidate Interaction Records in Biodiversity Datasets registered with GBIF and iDigBio as seen via Biodiversity Interaction Finder (BIF) hash://sha256/56ce90e39c2f3a038d577faad70b65573e7ab2e0e69f735f14d71cf4b3b0966b hash://md5/6c497d45da7f79d054d8a5203bd439c0 (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13272096
fyi @zedomel @seltmann
@seltmann says 2M
@jhpoelen says 2.1M