ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

User-side feedback #183

Open dmitriz opened 4 years ago

dmitriz commented 4 years ago

Opening this issue to document feedback and recommendation from the users' perspectives.

It is 2018 2020 and we still talk about papers. 😄

Minimal usage

$ getpapers -q covid
info: Searching using eupmc API
error: No output directory given. You must provide the --outdir argument.

Next simplest choice:

$ getpapers -q covid -o covid
info: Searching using eupmc API
info: Found 37494 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==----------------------------] 8% (eta 232.8s)^C

Smaller searches work nicely, apart from the warnings that are a bit confusing.

$ getpapers -q "covid tracing" -o tracing
info: Searching using eupmc API
info: Found 1155 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

Now refining:

getpapers -q "covid tracing korea" -o tracing
info: Searching using eupmc API
info: Found 254 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

And again:

getpapers -q "covid tracing korea taiwan vietnam" -o tracing
info: Searching using eupmc API
info: Found 26 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
petermr commented 4 years ago

Thanks. Valuable.

I am NOT the author of getpapers . Rick Smith-Unna is and we should try to get his views. Here are mine. I think they should be refiled as issues.

1) default directory.

2) infinite download. Yes, this is a major problem. There needs to be an inbuilt limit

3) cached download. The JSON is (I think) ordered by scientific priority. I don't know if the download order follows this.

4) overwriting and merging. This is an important issue. It's nice that you can download on top of an existing dir/CProject. But there may be implicit context that is lost. It probably useful to have a switch --overwrite

I am having to deal with some of this in ami download https://github.com/petermr/ami3

dmitriz commented 4 years ago

Thanks. Valuable.

Thank you for your appreciation. :)

I am NOT the author of getpapers . Rick Smith-Unna is and we should try to get his views.

Judged by the lack of responses to previous issues and last code back in 2016, this could be off his radar for quite a while.

default directory. pros: it's simple cons: some queries are a page long. We either truncate or hash.

What about using the search string?

infinite download. Yes, this is a major problem. There needs to be an inbuilt limit

100 results seem like a common default I've seen with many APIs. Also, the order is needed, maybe the 100 most recent ones?

cached download. The JSON is (I think) ordered by scientific priority. I don't know if the download order follows this.

By scientific priority, you mean the first mention? I didn't know the APIs could do such things. :)

overwriting and merging. This is an important issue. It's nice that you can download on top of an existing dir/CProject. But there may be implicit context that is lost. It probably useful to have a switch --overwrite

Agree. The user-friendliest way is probably to print an overwrite warning with options to select: yes, no, or yes to all to skip the rest of warnings.

I am having to deal with some of this in ami download https://github.com/petermr/ami3

Do you still need getpapers then?

petermr commented 4 years ago

Yes, we still need it. There are tutorials out there.