adsabs / ADSPlanetaryNamesPipeline

Pipeline to identify planetary nomenclature in fulltext of ADS records
MIT License
0 stars 1 forks source link

Export database content #18

Open golnazads opened 2 weeks ago

golnazads commented 2 weeks ago

Export database content: for the time being, we are going to have the run.py script output a CSV file that contains: Bibcode where the feature was found (e.g. 2000M&PS...35.1043T) Target (e.g. Moon) Feature type (e.g. Oceanus) Feature name (Oceanus Procellarum) Feature id (4395) - this is found in the source data from the gazeteer and is used to create the link to the online page for it (https://planetarynames.wr.usgs.gov/Feature/4395)

As specified in https://docs.google.com/document/d/11TvdloUDrbTXS7YbPg-8YKEzYUeiMv5Xu8esszt1jI8/edit?usp=sharing by Alberto

golnazads commented 2 weeks ago

Last year the count of feature name appearing in all the papers for a specific feature name/feature type/target was included as well. Should it be there, or should I remove it? @aaccomazzi

aaccomazzi commented 2 weeks ago

If I understand this question correctly, the count is superfluous because one can get it by simply counting the number of instances a particular feature appears in the CSV file (which is the number of articles where feature X appears). Am I getting this right?

A more interesting metric would be the number of times the feature appears within a given paper. With this number we could compute, if we wanted, TF/IDF for each feature in each paper, which may be a useful metric for retrieval further down the line. Is this something you can easily generate?

golnazads commented 2 weeks ago

So instead of summing all the instances per feature name and report it, you want to see them individually, you can sum them later if you want? OK.