Additional data for TFIDF normalization?

ReinV / SCOPE

Search and Chemical Ontology Plotting Environment

Other

1 stars 2 forks source link

Additional data for TFIDF normalization? #20

Closed magnuspalmblad closed 4 years ago

magnuspalmblad commented 4 years ago

The 2010-2019_ChEBI_IDs.tsv works great, and reads in faster than all the original files. However, I think it is more logical to include all years in the default file, from pre-1945 data up to and including the latest 2020 data. Then, as an option, we can allow the user to use a specific range of years for TFIDF normalization. Europe PMC are supposed to update to a more recent ChEBI version soon, and I would like to rerun the searches by year then. Will this data still fit in the GitHub repo?

ReinV commented 4 years ago

Will this data still fit in the GitHub repo?

Unfortunately, I have reached the max of 1GB storage for gitLFS, which we had to use for uploading because github limits file size to 100MB. We could consider linking to a different upload site for downloading the searches by year?

We could also make single files that reach the github max. size but allow us to not upload via gitLFS (and have a higher max. storage level)

magnuspalmblad commented 4 years ago

OK, I managed to upload all year-by-year TSV files to a new OSF project (https://osf.io/pvwu2/). I made you administrator as well, so you should be able to change the files and organize them into folders if necessary. The maximum file size is 5 GB, so this should be OK for this project I think. These files (with the PMIDs) are nice to have as a record/refernce, but I think SCOPE can use your summary files for the normalization, as we never use the individual articles in SCOPE, and they are read in faster.

ReinV commented 4 years ago

I uploaded the summarized searches by decade files. I think I will leave the 2010-2020 summarized file in the SCOPE project as standard file and link the OSF project in the README if people want to use specific decades for TFIDF. What if the script always uses every file in the searches_by_year folder? Then you can add files if you want or leave it with standard 2010-2020 option.

magnuspalmblad commented 4 years ago

Yes, I think this is a good solution - flexible yet not requiring any modification to the scripts, only the input data in the folder.