Science-for-Nature-and-People / Midwest-Agriculture-Synthesis

Synthesizing and visualizing the impact of conservation agriculture
6 stars 4 forks source link

Literature Search helper #38

Open nathanhwangbo opened 5 years ago

nathanhwangbo commented 5 years ago

Started work on this in wos_pubsearch.R

Question: Is there are a reason these duplicates are in references_for_app.csv? See the Paper_id pairs: (129, 204) , (45, 310), (142, 95) for exact matches, and (27, 315), (55,189), (231,128) for almost exact matches

swood-ecology commented 5 years ago

@LesleyAtwood do you have thoughts on why there are duplicates?

@nathanhwangbo it sounds like automatically searching the API--either with rwos or another R tool--might not be feasible for free? It seems like it's pretty straightforward in Python: https://pypi.org/project/wos/. Do you think that's a more robust way to go? We'll only need this to pull a .bib file to match against the initial reference list so we could in theory run the search with Python, export to R and do filtering, etc in R if that made more sense.

LesleyAtwood commented 5 years ago

@swood-ecology & @nathanhwangbo , the 'duplicates' are a product of the four separate reviews. These papers satisfied the criteria for more than one review meaning they had both tillage treatments and cover crop treatments, per se.

swood-ecology commented 5 years ago

Should we manage that by updating the literature search for each review separately, or as one big search?

LesleyAtwood commented 5 years ago

For the reviewers' sake it makes more sense to update each review separately, but we will need to figure out a way to match papers across reviews so we don't double count the papers.

LesleyAtwood commented 5 years ago

@nathanhwangbo , to help match references I added a refs_all_expanded.csv to google drive. Paper titles may be the best way to match provided the matching process isn't case sensitive. Also, we had to manually add a majority of the DOIs because they weren't included in the references.

nathanhwangbo commented 5 years ago

Thanks for adding refs_all_expanded. I added it to the repo for easier use.

Sorry if my original post wasn't clear. We will be able to automatically search the api using the r package rwos -- the limitation is that we are only able to access the "lite" version, which only contains 4 out of the 10 editions that Web of Science indexes. The same limitation would apply to the python package wos. However, this doesn't seem to be too much of an issue:

My current process to match the query with the doc is as follows:

  1. query web of science using rwos (using the queries from the doc in the google drive)
  2. match doi (between refs_all_expanded.csv and the query)
  3. for those without a doi match, I match exact titles
  4. for those without a doi match or a title match, I fuzzy-match the titles (using Levenshtein distance)
    • note: a lot of papers had small typos in the titles, so fuzzy-matching did a lot of work. I used a cutoff at distance < 30, which means it'll find a match if it finds titles that are the same plus/minus 30 characters. 30 was chosen by increasing the distance until I didn't see any more matches. alternatively, we can try a higher threshold, but also require that the two matches must have the same year.

After this process, there are only 3 papers in refs_all_expanded.csv that aren't in our query. Only one of these papers is excluded because we're using the "lite" version of the API. (it's this paper: http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=17&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=1 )

The other two are indexed by the API, so the problem has something to do with the query, not the fact that we're using the lite API. These are the two papers: http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=12&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=2 http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=7&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=7.

To confirm that the query is the issue, I tried passing the query into the web of science website, and was unable to find either of these two papers. I'm wondering if maybe these papers have changed keywords since the last time you guys ran the query.

Questions:

  1. Can you double check that my process for searching the website matches what you guys did? I'm copy/pasting the text from the google doc into the fields. This example is for tillage, where I was looking for this paper: http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=7&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=7
image
  1. What's the process for narrowing the query results down to what you guys use in the report? We can expand the query to get these two papers, but it might not be worth it if it makes the filtering step more painful
LesleyAtwood commented 5 years ago

@nathanhwangbo , will you please send me list of the titles with typos. I'd like to fix those before the database is freely accessible.

LesleyAtwood commented 5 years ago

@nathanhwangbo , I can't access any of the three links you sent because I'm off campus. Can you send me the paper titles?

From the image above, it looks like you searched WoS like we did. I'm not sure why there are two papers excluded from the query list. Once you send me those papers I can investigate.

We narrowed the query results in two stages. First, we reviewed all titles and abstracts and excluded ones that gave any indication that the paper would not match our criteria. If the paper passed the title and abstract review, then we downloaded the entire pdf and read the paper to determine if it matched our criteria. Data were extracted from the papers that met our criteria. It was a long process.

I don't think expanding the query to include the two rogue papers is worth it. There will already be quite a few papers to filter through, we don't want to add to that part of the process.

nathanhwangbo commented 5 years ago

Here's the file with the names I was able to fuzzy-match: https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/blob/master/title_fuzzymatch.csv. I guess calling them typos is a little misleading -- most of them are just small formatting differences.

Here are the paper titles for the three links, in the same order as above: "Changes in water-extractable organic carbon with cover crop planting under continuous corn silage production" (doi: 10.4137/ASWR.S30708)

"Impact of corn residue removal on soil aggregates and particulate organic matter" (doi: 10.1007/s12155-014-9413-0)

"Site-specific nitrogen management of irrigated maize: yield and soil residual nitrate effects" (doi: 10.2136/sssaj2002.0544)

LesleyAtwood commented 5 years ago

Based on the keywords, "Impact of corn residue removal on soil aggregates and particulate organic matter" wouldn't show up in the search. However, it does fit our criteria so we'll keep it.

I'm surprised "Site-specific nitrogen management of irrigated maize: yield and soil residual nitrate effects" doesn't show up in your search because it includes "variable rate application" as a keyword which is one of the Nutrient Management search terms. It also fits our criteria, so we'll keep it in the database.

When you run the fuzzy-match search do you get the same number of "Papers returned from initial search" as I report in the table I sent you?

nathanhwangbo commented 5 years ago

No, they're slightly lower. I originally assumed that was just because we're not looking at the entire web of science library, but maybe something's going on with the query.

For reference, here's how the numbers compare:

practice doc query through r query through website
cover crops 354 351 363
tillage 1130 1004 1029
pest mgmt 134 139 140
nutrient mgmt 738 722 738
LesleyAtwood commented 5 years ago

I just found a new feature within Colandr that calculates the number of unique papers included in the search. While it didn't change much for 3 of the reviews, the # of papers included in the cover crop review matches your number (351).

I think the discrepancy comes down to how I initially searched the papers. If you recall, I accidentally excluded Illinois from my list of states and then had to run a search specifically for Illinois at a later date. When I merged the bib files in Colandr it didn't always remove duplicate papers (possibly due to slight differences in title cases or spaces). The numbers I report in the table are based off the paper counts in Colandr.

@Steve, do you think we're okay to move forward when there are a few papers missing in @nathanhwangbo search? I honestly think it's something with Colandr.

nathanhwangbo commented 5 years ago

On a slightly different note, Julien made a function in the past to find DOIs using article titles (it uses the package rcrossref to do it). Using this function, I was able to correctly find the DOI for 135 out of the 158 references in refs_all_expanded.csv.

Original Title Original DOI Corrected DOI
Crop rotation and tillage system effects on weed seedbanks 10.1614/0043-1745(2002)050[0448:CRATSE]2.0CO;2 10.1614/0043-1745(2002)050[0448:cratse]2.0.co;2
Long-term tillage and drainage influences on soil organic carbon dynamics, aggregate stability and corn yield 10.1080/0038768.2013.878643 10.1080/00380768.2013.878643
Soil microaggregate and macroaggregate decay over time and soil carbon change as influenced by different tillage systems 10.2489/jswc.69.9.574 10.2489/jswc.69.6.574
Long-term tillage and drainage influences on greenhouse gas fluxes from a poorly drained soil of central Ohio 10.2489/jswc.69.9.574 10.2489/jswc.69.6.553
Tillage and crop rotation impacts on greenhouse gas fluxes from soil at two long-term agronomic experimental sites in Ohio 10.2489/jswc.69.9.543 10.2489/jswc.69.6.543
Tillage and cover cropping effects on soil properties and crop production in Illinois 0.2134/agronj2016.10.0613 10.2134/agronj2016.10.0613
Soil organic carbon changes impacted by crop rotation diversity under no-till farming in South Dakota, USA 10.2136/sssaj2016.14.0121 10.2136/sssaj2016.04.0121

(note: refs_all_expanded.csv has 272 unique DOIs. We were able to match 114 of them using the DOI from the automated query. the 158 number comes from 272 - 114)

The only concern are the papers that the function is linking to totally different papers. It'll be hard to catch these when we are looking at new papers. I modified Julien's function to include a tolerance parameter (ie letting us choose how close two titles have to match before we decide they're the same paper), and I've been playing around with trying to find a pretty "safe" parameter value without having to manually search for all the DOIs.

swood-ecology commented 5 years ago

@LesleyAtwood and @nathanhwangbo I think moving forward is fine when it's just a few papers. The tillage search looked like it had a pretty big difference (1130 to 1004). Can that be chalked up to the Illinois search issue?

If you think Colandr is the issue, could we use the .bib file that came directly from WoS to match, rather than the .csv generated by Colandr?

swood-ecology commented 5 years ago

Also, @nathanhwangbo I think what matters the most is getting the new search to match the old, whether that's by DOI or title. You shouldn't worry about correcting DOIs (unless it improves search matching) because it's not totally essential I be able to use BibScan to download papers. There will be a small enough number that I can do it manually if the DOI links are problematic.

LesleyAtwood commented 5 years ago

I just ran the search through WoS for Tillage. Here are the results, which are much more similar to what I got back in 2018 than what @nathanhwangbo 's table shows. I just copied and pasted the keywords I sent you into WoS and included or excluded Illinois.

query results

@nathanhwangbo , maybe rerun your queries like I did and we can see if your results match mine. If they don't then double check that you included all the keywords for each topic.

swood-ecology commented 5 years ago

What do you think makes for the difference between when you did it and when @nathanhwangbo did? Something to do with the automated vs manual search? That level of difference seems pretty good to me.

LesleyAtwood commented 5 years ago

@swood-ecology I'm really not sure why our results differ. Clearly the automated search is dropping results, but I'm even more perplexed why his manual search results don't match mine. Hopefully we'll know more once @nathanhwangbo runs the queries again.

nathanhwangbo commented 5 years ago

Found the culprit.

The difference between the two manual search results is a small difference in queries. The first term of the Tillage Specific Query in the doc is conservation till*. The version in the doc (what was originally used) does NOT have quotes around it. My version has quotes (ie "conservation till*").

Should I just stick with the old version? The difference is that the unquoted version is equivalent to conservation AND till*, while the quoted version looks specifically for the phrase(s) conservation till*

So that explains the difference between our WOS manual searches (1112 vs 1029). I double checked, and the difference between my WOS manual search and my WOS automated search (1029 -> 1004) is a direct result of querying using the Lite API (looking through less collections).

swood-ecology commented 5 years ago

Nice catch! Is it hard to make a list of the papers that were included in the original but not the latter search? My hunch is that we want "conservation till*" and that any papers that weren't caught by "conservation till*" but were grabbed by conservation till* wouldn't have made it through Lesley's final screening process. But I guess it would be good to confirm that.

nathanhwangbo commented 5 years ago

For tillage:

For Early Season Pest Management:

swood-ecology commented 5 years ago

I think we're good on tillage because those two papers that made it into the final reference list are actually there because they're papers about cover crops, not conservation tillage. So they must have also come out in the cover crops search.

nathanhwangbo commented 5 years ago

Ah, I didn't think about that! A quick check shows that you're right: both of the papers also show up in the cover crop query.

I'll go ahead and keep the quoted versions of the queries then

nathanhwangbo commented 5 years ago

I just realized that the Web of Science API doesn't give us a full citation (let alone in Bibtex format), so I'm imagining the following workflow:

  1. Query Web of Science API using rwos package (automated, but user specifics start/end dates)
  2. Pass titles into rcrossref package, to find a match in CrossRef. Separate out matches that aren't exact/super similar (automated)
  3. Do some kind of manual check to see if we can find a match for any of the non-exact matches
  4. Export remaining titles using rcrossref to Bibtex format
  5. Do the rest in BibScan package/colandr? Can we automate any of the filtering process after getting the bib file?

The alternative to using rcrossref is building out a bib file manually from the information from the WoS API. But the WoS API is missing a ton of DOIs, which I think we need for BibScan.

nathanhwangbo commented 5 years ago

Started to implement this workflow in wos_pubsearch_workflow.R.

FYI, there are 289 different papers that show up if we run the queries for 2018-2019.

Question: What's the plan for filtering through the query results? Is Colandr going to be involved?

LesleyAtwood commented 5 years ago

Colandr could be involved, but it's not necessary. The benefit of Colandr is that it helps keep the reviewer organized. I found, however, that the machine learning component of Colandr didn't really speed up the review process.

Because we don't have another option lined up lets plan to use Colandr. Is there any way we can get both the bib files and pdfs to automatically upload into Colandr? They will have to load by Review so that the filtering criteria doesn't become too overwhelming.

swood-ecology commented 5 years ago

@nathanhwangbo those 289 papers are ones that didn't show up in the original search that do show up when you do the search to the present day? that's a lot! sounds like I'll have to carve out some time to go through those.

@LesleyAtwood I'm fine with using Colandr for this, but I agree that it really just gives us a home base for going through things, rather than a useful machine learning tool. I'm not sure how you'd auto-upload stuff to Colandr since you can't interact with it using code (as far as I know). I think we'd have to have the script auto-run in the background somewhere, ping us once there were a certain number of papers, then we'd have to take that .bib and upload it manually into Colandr.

LesleyAtwood commented 5 years ago

I agree, 289 papers is a lot. I guess the topics are more popular than ever!

nathanhwangbo commented 5 years ago

Yup, these are 289 are completely new papers (it's actually 290 now 😄 )

Out of these 290 papers, I wasn't able to find DOIs for 3 of them (the titles for these three are saved in the matched_title_lower column of wos_cr_nomatch_20191002.csv)

The rest are in citations_20191001.bib.

I'm starting to play with BibScan, but haven't had a high success rate for downloading pdfs yet.

swood-ecology commented 5 years ago

Oh geez! Well I should get on reviewing those soon. @LesleyAtwood do you think we could load those into Colandr now and I could start reviewing?

LesleyAtwood commented 5 years ago

@swood-ecology, Yes, the Colandr reviews are cleared and ready for the next batch of papers. It will probably be easiest to use the same framework I used because the selection criteria are already created and described. I can send you the selection criteria protocol by the end of the day. It's ready to go I just want to read over it again.

swood-ecology commented 5 years ago

Great. Should we go over the first couple together just to make sure I have it right? Maybe on Monday? @nathanhwangbo do you have the searches saved to .bib files that we can load into Colandr? Also, do you think it would be possible to have a cron component to the script that would run it every month and let us know when there were 20 new papers?

LesleyAtwood commented 5 years ago

@swood-ecology , Tuesday would be better for me. I'm free between 8-11 and after our SOC market meeting

nathanhwangbo commented 5 years ago

I have the .bib files saved in the repo sorted by date of query (see here: https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/tree/master/auto_pubsearch/Bibfiles).

However, I tried to import one of them into Colandr and wasn't having any luck (I would try to import the file and nothing would happen). I'm totally new to Colandr, so it might just be that I'm missing a step. @brunj7 also tried with a different bibfile with the same result. He was able to get the import to work with an .ris file though.

A few other notes:

swood-ecology commented 5 years ago

@nathanhwangbo this paper describes a pretty cool workflow for regularly updating data, like our .bib files.

swood-ecology commented 5 years ago

@nathanhwangbo are the .bib files separated out by literature review? they didn't seem to be in the repo. you'll see that Lesley has 4 reviews, each of which correspond to search criteria for each review. I hope this isn't too complicated, but we'd need different .bib files for each review, rather than an overall .bib.

Screen Shot 2019-10-07 at 8 10 04 AM
nathanhwangbo commented 5 years ago

It's not a problem, but reason I put all the reviews together was so that I could easily remove papers that are duplicated across reviews. Should I just leave duplicates in there?

swood-ecology commented 5 years ago

I see what you mean. I would leave them separately, though, and keep it as how Lesley has done it. We'd have to totally re-do Colandr to be able to read in only one file.

nathanhwangbo commented 5 years ago

Ok, the files split by review are here (look for the 20191007 files): https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/tree/master/auto_pubsearch/Bibfiles

swood-ecology commented 5 years ago

Thanks @nathanhwangbo. I tried uploading the files to Colandr and was having the same problem with the references not being added. At first I wondered if it was a permissions issue and so tried to upload the .bib to a Colandr review for which I'm the owner (@LesleyAtwood owns the AgEvidence reviews). That didn't work.

Then I was wondering if there might be something different about the .bib files that you're writing from R vs what's downloaded directly from WoS. One think I noticed is that the .bib files you generated don't have all of the information as the WoS .bib, which includes things like the full abstract, which is needed for the initial screening.

Do you think this is reconcilable, or do you think we should be thinking about a workflow where perhaps the cron process flags for us when there's a certain number of papers that are new, but then we do the actual search and .bib extraction manually within Web of Science?

swood-ecology commented 5 years ago

I did a search for "cover crops" in Wisconsin since 2017 and I exported it with the options below:

First, select other file type.

Screen Shot 2019-10-08 at 9 02 20 AM

Then, export full reference list as a BibTex file.

Screen Shot 2019-10-08 at 9 02 58 AM

Here's the final reference list (adding as a .zip because GitHub doesn't understand .bib files)

savedrecs.bib.zip

swood-ecology commented 5 years ago

Also, @LesleyAtwood identified some differences between the references your search generated vs the original approach. Here's your reference: @article{Snapp_2018, doi = {10.1016/j.still.2018.02.018}, url = {https://doi.org/10.1016%2Fj.still.2018.02.018}, year = 2018, month = {aug}, publisher = {Elsevier {BV}}, volume = {180}, pages = {107--115}, author = {Sieglinde Snapp and Sowmya Surapur}, title = {Rye cover crop retains nitrogen and doesn't reduce corn yields}, journal = {Soil and Tillage Research} } Here's Lesley's original: @article{ ISI:000414816800024, Author = {Sivarajan, S. and Maharlooei, M. and Bajwa, S. G. and Nowatzki, J.}, Title = {{Impact of soil compaction due to wheel traffic on corn and soybean growth, development and yield}}, Journal = {{SOIL \& TILLAGE RESEARCH}}, Year = {{2018}}, Volume = {{175}}, Pages = {{234-243}}, Month = {{JAN}}, Abstract = {{As the size and weight of agricultural equipment have increased significantly in the past few decades, the severity and depth of compacted zone may have increased proportionately. Past research indicates that soil compaction affects crop growth and grain yield. Very few studies have been conducted in North Dakota (ND) to understand soil compaction under the current machinery, and its effect on crop growth and yield. The research was conducted on a no-till crop field at Jamestown, ND, USA for 2013 (corn) and 2014 (soybean) growing season. The objective of this study was to evaluate the effect of wheel traffic on soil strength indices and its impact on crop emergence, development and yield. The study also evaluated the effect of winter freezing thawing cycle on soil compaction in the study field. The experiment consisted of five soil transects and two traffic conditions based on machinery traffic in the field for both years such as most trafficked (MD rows and least trafficked (LT) rows, laid out in a randomized complete block design with three replicates in strip-plot with space for corn season in 2013, and for soybean season in 2014. Data collected included soil resistance or cone index (CI), soil bulk density, soil moisture content, plant emergence, plant height and grain yield. The results showed that CI values followed a similar pattern for different soil transects up to 37.5 cm depth and then increased sharply. An average CI of 1.19 MPa was noted over the whole profile at 0-45 cm depth for the study area and not significantly different between MT and LT rows for both years. Moderate compaction resulted in early emergence of corn plants in MT rows by 175\% compared to LT rows. The plant height didn't show any significant difference between MT and LT rows for both years. The yield data showed significant difference between the soil transects, but no difference was observed between MT and LT rows in both 2013 and 2014 season. The interactions between soil transects and traffic conditions were not significantly different for all soil and plant related dependent variables. The freeze-thaw cycle occurred during winter from 2013 to 2014 and 2014 to 2015 alleviated soil resistance over the whole soil profile at 0-45 cm depth. Results show that different crops grown in a no till field are not very much influenced by wheel traffic. The study also suggests that moderate compaction occurred after harvest in a no till field could be alleviated by the effect of freeze thaw cycle.}}, DOI = {{10.1016/j.still.2017.09.001}}, ISSN = {{0167-1987}}, EISSN = {{1879-3444}}, Unique-ID = {{ISI:000414816800024}}, }

nathanhwangbo commented 5 years ago

Thanks for doing that testing.

Can you do similar testing to show how you imported your Wisconsin test file (savedrecs-2.bib) into Colandr? I'm not having any luck, even with this file. Once I figure out how to import a file into Colandr, I'll be able to test what information Colandr minimally needs to accept a reference.

In general, though, we are able to get most of the information in the original references -- the ones we don't have are Abstract and EISSN.

That being said... if abstracts are required in the .bib file, then it might be better to create a workflow where you guys manually do the query/get .bib files from Web of Science (as you suggested). The Web of Science API doesn't give us abstracts, so I'm trying to grab it from the CrossRef API. While this workflow worked well with DOIs (matching all but 3 papers), it's performing poorly for abstracts (out of the 288 papers in the 2018-2019 queries, CrossRef was only able to find 37 abstracts).

swood-ecology commented 5 years ago

That's too bad you can't get abstracts because those are definitely a must-have. They pop up in Colandr and allow us to screen the papers within Colandr. So maybe we should think about the manual workflow. Do you think there's anything we could automate? Like, doing the search automated through cron and ping us as a reminder when we should do the manual search?

I'm not sure what's up with Colandr not importing those .bib files. Let me quick email the creator and see if she has any idea.