OSC / phylogatr-web

The web app for the Phylogatr Project - https://phylogatr.org/
https://phylogatr.org/
MIT License
0 stars 0 forks source link

gbif data filtered #27

Open johrstrom opened 2 years ago

johrstrom commented 2 years ago

I'm linking gbif data, and the 1.7 TB file doesn't seem to be anything that gets through a regex for the id.

I see around there's filtered in the name of some files, but I don't know how they got filtered.

johrstrom commented 2 years ago

This file gets filtered down from 1.7 TB to 2.5 GB. Which means we only use about ~14% of gbif data.

2.5G /fs/project/PAS1604/gbif/0147211-200613084148143.filtered.txt

johrstrom commented 2 years ago

This is handled in this file (or some variation of the file as I'm about to rename it).

https://github.com/OSC/phylogatr-web/blob/40504e83491626da3a2279304c7464f6ce21df58/gbif_filter_occurrences.pbs

This ticket will now be to document this filtering.