Internal data formatting process

nealhaddaway commented 2 years ago

Go from long data format for dataframe (stacked results or full results before deduplication)

to

Wide data format with one record per row and sources as columns

DrMattG commented 2 years ago

Would we start like this?

library(synthesisr)
#> Warning: package 'synthesisr' was built under R version 4.0.5
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.5
#> Warning: package 'ggplot2' was built under R version 4.0.5
#> Warning: package 'tibble' was built under R version 4.0.5
#> Warning: package 'tidyr' was built under R version 4.0.5
#> Warning: package 'readr' was built under R version 4.0.5
#> Warning: package 'dplyr' was built under R version 4.0.5
ris_path="C:/Users/matthew.grainger/Downloads/Embase_290521_N974.ris"
#ris_path="https://gitlab.com/extending-the-earcheck/living-review/-/raw/master/search/literature_search_02/Embase_290521_N974.RIS"

my_ris=synthesisr::read_refs(ris_path)

#adding databases in 
my_ris2=my_ris %>% 
  sample_frac(0.3)
my_ris =my_ris %>% 
  bind_rows(my_ris2)
my_ris$Database=sample(c("DB1", "DB2", "DB3"),dim(my_ris)[1],replace = TRUE )

my_ris %>% str()
#> 'data.frame':    1267 obs. of  31 variables:
#>  $ language      : chr  "English" "English" "Polish" "English" ...
#>  $ author        : chr  "Zohar, D. and Cohen, A. and Azar, N." "Zeigler, M. C. and Taylor, J. A." "Zawadzka, Bozena" "Zahr, L. K. and de Traversay, J." ...
#>  $ address       : chr  "Dov Zohar, Technion-Israel Inst. Technol., Haifa Israel" "M.C. Zeigler, Nazareth College, 4245 East Avenue, Rochester, NY 14618-3790, United States. E-mail: mczeigle@naz.edu" "B. Zawadzka, Pracowni Promocji Zdrowia Wojewodzkiego Osrodka Medycyny Pracy, Kielcach." "L.K. Zahr, University of California, Los Angeles, School of Nursing 90024-6919, USA." ...
#>  $ year          : chr  "1980" "2001" "2002" "1995" ...
#>  $ title         : chr  "Promoting increased use of ear protectors in noise through information feedback" "The effects of a tinnitus awareness survey on college music majors' hearing conservation behaviors" "Awareness and knowledge as prerequisites for health oriented behavior" "Premature infant responses to noise reduction by earmuffs: effects on behavioral and physiologic measures" ...
#>  $ source        : chr  "Human Factors" "Medical Problems of Performing Artists" "Medycyna pracy" "Journal of perinatology : official journal of the California Perinatal Association" ...
#>  $ volume        : chr  "22" "16" "53" "15" ...
#>  $ issue         : chr  "1" "4" "6" "6" ...
#>  $ start_page    : chr  "69-79" "136-143" "489-493" "448-455" ...
#>  $ abstract      : chr  "Workers in a noisy department of a metal fabrication plant took hearing tests before and at the end of their wo"| __truncated__ "The purpose of the study was to investigate the effects of a tinnitus awareness survey on the hearing conservat"| __truncated__ "The major aim of the Workers' Hearing Protection Program, undertaken by the Provincial Center of Occupational M"| __truncated__ "The continuous high-intensity noise in the neonatal intensive care unit (NICU) is both stressful and harmful fo"| __truncated__ ...
#>  $ keywords      : chr  "auditory system and *ear protection and *industrial hygiene and *noise and prevention and psychological aspect" "adolescent and adult and article and awareness and behavior and controlled study and female and follow up and h"| __truncated__ "adult and article and *attitude to health and female and *health behavior and health promotion and human and *i"| __truncated__ "analysis of variance and article and breathing and case control study and *child behavior and clinical trial an"| __truncated__ ...
#>  $ doi           : chr  "http://dx.doi.org/10.1177/001872088002200108" NA NA NA ...
#>  $ issn          : chr  "0018-7208" "0885-1158" "0465-5893" "0743-8346" ...
#>  $ url           : chr  "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=emed3&NEWS=N&AN=10041337" "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=emed7&NEWS=N&AN=33144937" "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=emed7&NEWS=N&AN=36696570" "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=emed5&NEWS=N&AN=126224507" ...
#>  $ supertaxa     : chr  "Promoting increased use of ear protectors in noise through information feedback" "The effects of a tinnitus awareness survey on college music majors' hearing conservation behaviors" "Awareness and knowledge as prerequisites for health oriented behavior" "Premature infant responses to noise reduction by earmuffs: effects on behavioral and physiologic measures" ...
#>  $ ID            : chr  "3009" "2816" "2795" "2890" ...
#>  $ ZZ            : chr  "ï»¿TY  - JOUR" NA NA NA ...
#>  $ source_type   : chr  NA "JOUR" "JOUR" "JOUR" ...
#>  $ RI            : chr  NA NA NA NA ...
#>  $ MP            : chr  NA NA NA NA ...
#>  $ CD            : chr  NA NA NA NA ...
#>  $ RE            : chr  NA NA NA NA ...
#>  $ published_elec: chr  NA NA NA NA ...
#>  $ diseases      : chr  NA NA NA NA ...
#>  $ chemicals     : chr  NA NA NA NA ...
#>  $ BO            : chr  NA NA NA NA ...
#>  $ N2            : chr  NA NA NA NA ...
#>  $ EA            : chr  NA NA NA NA ...
#>  $ GA            : chr  NA NA NA NA ...
#>  $ SV            : chr  NA NA NA NA ...
#>  $ Database      : chr  "DB3" "DB2" "DB2" "DB2" ...

^{Created on 2022-02-21 by the reprex package (v2.0.1)}

nealhaddaway commented 2 years ago

Yeah - great idea. We also need an internal id that could be just a sequence of integers from 1 to nrows(df), and a new column with the source label.

nealhaddaway commented 2 years ago

Kaitlyn said she can add columns for each source with the 'id' output from ASySD, so essentially our internal df with a wide source db added (one column per source for each row/record).

chriscpritchard commented 2 years ago

The way we've envisaged so far for the long dataframe) is the import from synthesisr, then additional columns prefixed with cs__ (so we aren't going to clobber any user defined fields in the RIS or BibTeX files:

csdb - this contains the database csplatform - the database platform cssearchid - A reference to the search string used csfilename - the filename from the RIS file.

DB, platform and searchid are all optional (filename can then be used for any analysis if NA). This comes from the first get together where the discussion was that we needed database, platform and some form of search strategy. As the fields are optional they don't have to be used, or write any code regarding them or expose to the UI, but does mean we can implement later if we want.

we'd also have csid and csdup or similar, with dup being a list of IDs that are duplicates, which we could get from using the code in ASySD.

@kaitlynhair - perhaps seeing how the data can come from CiteSource to ASySd would help with this?

nealhaddaway commented 2 years ago

Thanks, Chris - what is the benefit to adding more columns for a single label?

The way I see it, each file or block of refs has a single label that corresponds to lots of search fields (not just db, platform, search etc.). Isn't it cleaner to have a 'thinner' internal df with a single 'proxy' for the search details that can be resolved later? Otherwise we're adding 4 times the number of columns than we need as a minimum, no?

Maybe I'm missing something, though... What do you need those four columns for?

chriscpritchard commented 2 years ago

Storing things in separate columns means it’s easier to manipulate later, if we conflate them into one it’s then a lot harder to manipulate and separate out if we decide to extend the project. Currently they can be unused though!

From: nealhaddaway @.> Sent: 22 February 2022 15:54 To: ESHackathon/CiteSource @.> Cc: Chris Pritchard @.>; Comment @.> Subject: Re: [ESHackathon/CiteSource] Internal data formatting process (Issue #16)

Thanks, Chris - what is the benefit to adding more columns for a single label?

The way I see it, each file or block of refs has a single label that corresponds to lots of search fields (not just db, platform, search etc.). Isn't it cleaner to have a 'thinner' internal df with a single 'proxy' for the search details that can be resolved later? Otherwise we're adding 4 times the number of columns than we need as a minimum, no?

Maybe I'm missing something, though... What do you need those four columns for?

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1047939081, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOA7YTPHTLM5DUV6X56QBDU4OWRZANCNFSM5O6ZNHFQ. You are receiving this because you commented.Message ID: @.**@.>>

nealhaddaway commented 2 years ago

I didn't mean to compress them into one, I mean having a separate set of details storing the search history information - this is in line with the search history data standard that's been developed by a set of search strategy specialists I meantioned last week - it's a single file holding all the relevant search data - something like 30 different fields for each search. We're building that into RIS files, so how would we handle that?

My thinking was to have a single label in our df that references back to this separate info in a single JSON string for each source.

From: Chris Pritchard @.> Sent: 22 February 2022 15:56 To: ESHackathon/CiteSource @.> Cc: nealhaddaway @.>; Author @.> Subject: Re: [ESHackathon/CiteSource] Internal data formatting process (Issue #16)

Storing things in separate columns means it’s easier to manipulate later, if we conflate them into one it’s then a lot harder to manipulate and separate out if we decide to extend the project. Currently they can be unused though!

From: nealhaddaway @.> Sent: 22 February 2022 15:54 To: ESHackathon/CiteSource @.> Cc: Chris Pritchard @.>; Comment @.> Subject: Re: [ESHackathon/CiteSource] Internal data formatting process (Issue #16)

Thanks, Chris - what is the benefit to adding more columns for a single label?

The way I see it, each file or block of refs has a single label that corresponds to lots of search fields (not just db, platform, search etc.). Isn't it cleaner to have a 'thinner' internal df with a single 'proxy' for the search details that can be resolved later? Otherwise we're adding 4 times the number of columns than we need as a minimum, no?

Maybe I'm missing something, though... What do you need those four columns for?

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1047939081, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOA7YTPHTLM5DUV6X56QBDU4OWRZANCNFSM5O6ZNHFQ. You are receiving this because you commented.Message ID: @.**@.>>

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1047941177, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXHTK5X4SBO4H276X5DU4OWZDANCNFSM5O6ZNHFQ. You are receiving this because you authored the thread.Message ID: @.***>

chriscpritchard commented 2 years ago

I get you now! Yes can do that but take the db / search / platform / any other arbitrary key to add to the json then I think

Let me have a think

nealhaddaway commented 2 years ago

Thanks Chris - the json already has search labels unique to the search record given by the user and a search ID unique to the record from the archiving database. They should be grand, but any unique I’d within the set uploaded would be fine for the operations performed in one instance. Thanks

Sent from my iPhone

On 22 Feb 2022, at 18:57, Chris Pritchard @.***> wrote:

I get you now! Yes can do that but take the db / search / platform / any other arbitrary key to add to the json then I think

Let me have a think

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1048112623, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXCBFW6PSJZ4P74IA4TU4PL77ANCNFSM5O6ZNHFQ. You are receiving this because you authored the thread.Message ID: @.***>

LukasWallrich commented 2 years ago

I see where you are coming from here, and there should probably be an option to link this to a full-information JSON. However, we should also make it simple to achieve the main use case ... which, as I understand it, is someone who wants to upload a couple of RIS files to get the results table and the related visualisations cut based on some (hierarchical) labels - e.g., platform, database and search id (which might be timepoints, strings, or whatever else the user wants to break out in the reporting).

For that, I would want to be able to upload/import a few RIS files, add (hierarchical) labels and get the desired outputs, without worrying about separate search data. If we are not sure whether platform, database and search id are the most common cuts, we could maybe keep all these labels user-defined?

chriscpritchard commented 2 years ago

I think we can store internally as JSON though, we’d just have an ability to create a barebones json with user defined inputs

Get Outlook for iOShttps://aka.ms/o0ukef

From: Lukas Wallrich @.> Sent: Tuesday, February 22, 2022 7:46:00 PM To: ESHackathon/CiteSource @.> Cc: Chris Pritchard @.>; Comment @.> Subject: Re: [ESHackathon/CiteSource] Internal data formatting process (Issue #16)

I see where you are coming from here, and there should probably be an option to link this to a full-information JSON. However, we should also make it simple to achieve the main use case ... which, as I understand it, is someone who wants to upload a couple of RIS files to get the results table and the related visualisations cut based on some (hierarchical) labels - e.g., platform, database and search id (which might be timepoints, strings, or whatever else the user wants to break out in the reporting).

For that, I would want to be able to upload/import a few RIS files, add (hierarchical) labels and get the desired outputs, without worrying about separate search data. If we are not sure whether platform, database and search id are the most common cuts, we could maybe keep all these labels user-defined?

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1048151348, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOA7YXKYYOL7BFQ5SW7VSTU4PRXRANCNFSM5O6ZNHFQ. You are receiving this because you commented.Message ID: @.***>

LukasWallrich commented 2 years ago

That makes sense - I'm not sure if JSON is a helpful format for internal storage in R ... tibbles/data.frames are certainly more widely supported - but either would work.

nealhaddaway commented 2 years ago

Take a loo at the JSON example I sent last week - @chriscpritchard can you forward this to @LukasWallrich, please?

JSON is definitely better than a flat tibble or df.

Let’s stick with a single source reference in the internal df that we can lookup from another internal reference whatever that be, then we can move on.

kaitlynhair commented 2 years ago

To show you the ASYSD workflow I have tweaked, I tried using some ris files on OSF (see below). I'm having some issues with synthesisr for different databases - different columns depending on the database for key metadata - so I had to alter the ris file from CINAHL a bit.

#load packages
#currently modifying ASySD so sourcing code here
source('C:/Users/kaitl/OneDrive/Desktop/CiteSource/dedup_citations.R')
library(synthesisr)
library(tidyverse)

#load citation files
ris1 <- read_refs("games_pubmed.ris")
ris2 <- read_refs("games_cinahl.ris")

#set db (user input)
db1 <- "PubMed"
db2 <- "CINAHL"

#format ris files
#need to come up with conversion method for some files
#not interoperable code - just for this to work!
ris2 <- ris2 %>% rename(source = JO,  doi = DO,   volume = VL,  issn = SN, title = ID) 

ris2$title <- gsub(".*  -", "", ris2$title)
ris2$author <- gsub("and Y1.*", "", ris2$author)

ris1$database <- db1
 ris2$database <- db2

#combine citation files
raw_citations <- bind_rows(ris1, ris2)

#deduplicate citations (merge metadata - asysd)
unique_citations <- dedup_citations(raw_citations) 
#Joining, by = "record_id"

unique_citations %>% glimpse()

# Rows: 470
# Columns: 33
# $ duplicate_id       <chr> "1001", "1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009", "...
# $ record_id          <chr> "1001", "1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009", "...
# $ date_generated     <chr> "2015/06//undefined", "2016/10/08/", "2016/03/15/", "2016/06//undefined",...
# $ source_type        <chr> "JOUR", "JOUR", "JOUR", "JOUR", "JOUR", "JOUR", "JOUR", "JOUR", "JOUR", "...
# $ language           <chr> "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng", "en...
# $ author             <chr> "Hale, Lauren and Guan, Stanford", "Hoare, Erin and Milton, Karen and Fos...
# $ year               <chr> "2015", "2016", "2016", "2016", "2018", "2020", "2017", "2017", "2018", "...
# $ title              <chr> "Screen time and sleep among school-aged children and adolescents: a syst...
# $ source             <chr> "Sleep medicine reviews", "The international journal of behavioral nutrit...
# $ source_abbreviated <chr> "Sleep Med Rev", "Int J Behav Nutr Phys Act", "J Affect Disord", "Appl Ph...
# $ volume             <chr> "21", "13", "193", "41", "60", "9", "69", "6", "74", "120", "2015", "34",...
# $ start_page         <chr> "50", "108", "391", "S240", "645", "1", "348", "321", "72", "77", "546925...
# $ end_page           <chr> "58", "", "404", "265", "659", "14", "367", "333", "102", "91", "", "308"...
# $ abstract           <chr> "We systematically examined and updated the scientific literature on the ...
# $ keywords           <chr> "Female and Humans and Male and Adolescent and Cell Phone/statistics & nu...
# $ doi                <chr> "10.1016/j.smrv.2014.07.007", "10.1186/s12966-016-0432-4", "10.1016/j.jad...
# $ issn               <chr> "1532-2955 1087-0792", "1479-5868", "1573-2517 0165-0327", "1715-5320 171...
# $ issue              <chr> "", "1", "", "6 Suppl 3", "7", "1", "4", "3", "", "", "", "4", "3", "3", ...
# $ address            <chr> "", "", "", "", "", "", "", "", "Department of Health and Rehabilitation ...
# $ publisher          <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "...
# $ database           <chr> "PubMed, CINAHL", "PubMed", "PubMed", "PubMed", "PubMed", "PubMed", "PubM...
# $ date_published     <chr> "EBSCOhost", "", "", "", "", "", "", "", "EBSCOhost", "EBSCOhost", "", ""...
# $ DB                 <chr> "cin20", "", "", "", "", "", "", "", "cin20", "cin20", "", "", "", "", ""...
# $ EP                 <chr> "58", "", "", "", "", "", "", "", "102", "91", "", "", "", "", "", "1143"...
# $ JA                 <chr> "SLEEP MED REV", "", "", "", "", "", "", "", "RES DEV DISABIL", "INT J ME...
# $ JF                 <chr> "Sleep Medicine Reviews", "", "", "", "", "", "", "", "Research in Develo...
# $ KW                 <chr> "", "", "", "", "", "", "", "", "Motor Skills Disorders -- Psychosocial F...
# $ PB                 <chr> "Elsevier B.V.", "", "", "", "", "", "", "", "Pergamon Press - An Imprint...
# $ SP                 <chr> "50", "", "", "", "", "", "", "", "72", "77", "", "", "", "", "", "1139",...
# $ TY                 <chr> "JOUR", "", "", "", "", "", "", "", "JOUR", "JOUR", "", "", "", "", "", "...
# $ UR                 <chr> "https://search.ebscohost.com/login.aspx?direct=true&db=cin20&AN=10974604...
# $ CY                 <chr> "New York, New York", "", "", "", "", "", "", "", "", "New York, New York...
# $ AV                 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "...

#compare citation sources - 1 indicates presence of citation in database, 0 = absence 
data_sources <- source_comparison(unique_citations) 
head(data_sources)
# # A tibble: 6 x 4
# duplicate_id PubMed CINAHL pub_id     
# <chr>        <chr>  <chr>  <chr>      
#   1 1001         1      1      Hale-2015  
# 2 1002         1      0      Hoare-2016 
# 3 1003         1      0      Serati-2016
# 4 1004         1      0      Carson-2016
# 5 1005         1      0      Paulus-2018
# 6 1006         1      0      Kracht-2020

LukasWallrich commented 2 years ago

Looks good. Regarding the field names, I wonder whether we should aim to extend synthesisr's field relabeling (which fails for CINAHL), or import that code into our package, and run read_refs with tag_naming = "none"? The latter might be more robust as we can then report which fields were matched, let the user check and fix issues?

As a step in-between, we could propose that synthesisr returns the relabeling as an attribute to the dataframe, which we can then check and augment? Just relying on the black-box auto-relabeling seems a bit risky.

If we wanted to take over that part of the code, we should start from the detect_lookup() function here?

nealhaddaway commented 2 years ago

Good idea on avoiding black-box labelling - I'm not sure it's been tested broadly.

On the other labelling issue, I realised that we can allow users to provide a single label/proxy for each file/set of records from a different source. We could allow them to enter a DOI for a [searchrxiv.org](https://www.searchrxiv.org) record and I could build a scraper that automatically pulls in all the search details so that we have that as an option automatically. That's another/an easier alternative to using embedded RIS files.

chriscpritchard commented 2 years ago

Hi,

So I haven't had loads of time to work on this today, but plan to do more tomorrow. Currently looking to have the following as the function definition - search_JSON_field can be defined in which case it will pull the JSON record by record from the RIS file, othewise it can apply the same JSON to every element, or take a vector of json strings the same length of the list of files and apply in the same order.

In terms of our internal use, I think the best option would be to store each field in the JSON in the dataframe with a prefix (e.g. csjson__fieldname)

to save us translating from JSON to something directly usable each time we want to do some analysis with that data, we can then spit out the JSON at the end with toJSON or fromJSON (or similar) with jsonlite. Hopefully can get that done tomorrow.

read_citations <- function(files,
                            search_json = NA,
                            search_json_field = NA,
                            tag_naming = "none")

Edit: in terms of taking a basic user input, it would be easy to have a function that takes our minimum stuff and creates a JSON from it so that's no problem!

nealhaddaway commented 2 years ago

Thanks, Chris - the JSON should only occur once in each RIS file, or once at the start of each chunk, but searching in each record is fine if it can cope with blanks between records. The embedded record is done on a per-file basis to avoid bloating the files.

From: Chris Pritchard @.> Sent: 23 February 2022 20:40 To: ESHackathon/CiteSource @.> Cc: nealhaddaway @.>; Author @.> Subject: Re: [ESHackathon/CiteSource] Internal data formatting process (Issue #16)

Hi,

So I haven't had loads of time to work on this today, but plan to do more tomorrow. Currently looking to have the following as the function definition - search_JSON_field can be defined in which case it will pull the JSON record by record from the RIS file, othewise it can apply the same JSON to every element, or take a vector of json strings the same length of the list of files and apply in the same order.

In terms of our internal use, I think the best option would be to store each field in the JSON in the dataframe with a prefix (e.g. csjson__fieldname)

to save us translating from JSON to something directly usable each time we want to do some analysis with that data, we can then spit out the JSON at the end with toJSON or fromJSON (or similar) with jsonlite. Hopefully can get that done tomorrow.

read_citations <- function(files, search_json = NA, search_json_field = NA, tag_naming = "none")

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1049195569, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXCOZ6M7R45MPWPVLRTU4VA2XANCNFSM5O6ZNHFQ. You are receiving this because you authored the thread.Message ID: @.***>

LukasWallrich commented 2 years ago

Edit: in terms of taking a basic user input, it would be easy to have a function that takes our minimum stuff and creates a JSON from it so that's no problem!

Makes sense & thanks for pushing this forward. Just one thought before I drop the idea of more direct labeling: if we want to allow users to simply use the package for the use cases specified, it would be great if they could use the read_citations() function rather than another helper and just pass the labels there - which would only require alternatively accepting source name and other labels.

That might also make it easier to realise the use case of comparing different stages of the screening process. Currently, I would not know where to specify which citations are initial vs final results - that does not seem to fit into the remit of the JSON (see the last mock-up in #18)

So, I would suggest the following specification:

read_citations <- function(files,
                            search_json = NA,
                            search_json_field = NA,
                            source_names = NA,
                            file_tags = NA,
                            tag_naming = "none")

Where file_tags could be things like "initial search" and "included sources"

Let me know if I can help with the implementation.

kaitlynhair commented 2 years ago

@LukasWallrich here is an example of the source comparison table for feeding into the visualisations. Note that it doesn't include the tags yet - unsure whether to make this a separate table (e.g. for comparing search, post-screening, post-data extraction stages or different searches from same source). I could also add tags as an addition to the column e.g. Scopus_tag1 Scopus_tag2 or as a separate column. example_source_comparison_table.csv

kaitlynhair commented 2 years ago

Something I maybe didn't make clear before is that ASySD has certain fields required for deduplication. It's OK if some are missing data, blank columns will just be added in. Ideally we should these fields specified in our input somewhere:

title author year pages volume number journal doi isbn abstract record_id label (the same as "tag" - when you upload the file in the shiny app) source (e.g. "Scopus")

This is the roadblock I've had with the input -> output process as these will likely need to be user specified for some sources (if .RIS file comes directly from the database)

LukasWallrich commented 2 years ago

@LukasWallrich here is an example of the source comparison table for feeding into the visualisations.

Thanks. Could you also share an example of the underlying data - probably best with saveRDS? Then I will make sure that the visualisation functions take that format.

Note that it doesn't include the tags yet

I would have liked tags as a separate level of aggregation, particularly for the filtering steps - but I think the majority view was that we should work with just one label for now and let users decide whether they want the source to refer to a database, a search string, a stage of the filter, or any combination of these? @nealhaddaway @chriscpritchard could you confirm your understanding of where we stand on this?

nealhaddaway commented 2 years ago

Yes - I agree. We can lookup tags if need be. :)

Sent from my iPhone

On 21 Mar 2022, at 20:17, Lukas Wallrich @.***> wrote:

@LukasWallrichhttps://github.com/LukasWallrich here is an example of the source comparison table for feeding into the visualisations.

Thanks. Could you also share an example of the underlying data - probably best with saveRDS? Then I will make sure that the visualisation functions take that format.

Note that it doesn't include the tags yet

I would have liked tags as a separate level of aggregation, particularly for the filtering steps - but I think the majority view was that we should work with just one label for now and let users decide whether they want the source to refer to a database, a search string, a stage of the filter, or any combination of these? @nealhaddawayhttps://github.com/nealhaddaway @chriscpritchardhttps://github.com/chriscpritchard could you confirm your understanding of where we stand on this?

— Reply to this email directly, view it on GitHubhttps://github.com/ESHackathon/CiteSource/issues/16#issuecomment-1074515402, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXFEET27UNDVV6DFNSLVBD7O5ANCNFSM5O6ZNHFQ. You are receiving this because you were mentioned.Message ID: @.***>

ESHackathon / CiteSource

Internal data formatting process #16