ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 2 forks source link

Shiny: Maximum upload size #136

Closed TNRiley closed 1 year ago

TNRiley commented 1 year ago

I'm hitting a maximum upload size when testing the Shiny. One of the files I have is just under 90MB, which is large, but I assume that there will be folks that will have similar-sized files. I'm not sure what size we should consider or if there are any memory/processing power issues that need to be considered. Currently, our max is 30MB.

LukasWallrich commented 1 year ago

shinyapps.io allocates 1GB of RAM - not sure how much a 90 MB file takes up once it is read in.

@TNRiley could you import that file in R and then run print(object.size(YOUR_DATA), units = "Mb") - that would give us a sense of when we might run into issues.

TNRiley commented 1 year ago

Ran a couple just to compare. I have a Dimensions file that has 23,373 citations - reads in at 53.7 Mb. I also have a Scopus .ris with 28,748 citations which reads in at 175.9 Mb!

The DIM data has 20 variables, the Scopus one has 433?! yikes.

I'm going to run the other large .ris files I have and post a table with # of citations, Mb, and # of variables....

TNRiley commented 1 year ago

@kaitlynhair have you seen any .ris read in with a ridiculous number of variables like this?

TNRiley commented 1 year ago

Here is the info on the 5 larger files I have from our recent systematic map Dimensions ~ 23k citations - 53.7 Mb (20 variables) LENS ~ 23k citations - 71Mb (105 variables) ProQuest ~16k citations - 94.9Mb (256 variables) Web of Science ~ 22k citations - 128.4Mb (477 variables) Scopus ~29k citaitons - 175.9Mb (433 variables)

I've temporarily saved the RData so that folks can take a look for themselves. it's in the vignettes/troubleshooting folder.

I'm extremely surprised by the number of variables in Scopus and WoS. These were all read in using sythesisr_read_ref()

TNRiley commented 1 year ago

further analysis on variables when importing is required. I'll bump the size on the shiny, but we can review the needed columns once we look at things.

kaitlynhair commented 1 year ago

In global (before UI), I added this line to help with max upload

options(shiny.maxRequestSize=1000*1024^2, timeout = 40000000)

LukasWallrich commented 1 year ago

You can use this to check for missing data: refsPQ %>% naniar::miss_var_summary() - then you will see that many have 99.999% missing, but there is one entry for each field.

LukasWallrich commented 1 year ago

In global (before UI), I added this line to help with max upload

options(shiny.maxRequestSize=1000*1024^2, timeout = 40000000)

Probably the easiest way to distinguish local and online use would be with

if (interactive()) {
  LARGE (Kaitlyns code)
} else {
  SMALL (Kaitlyns code)
}
LukasWallrich commented 1 year ago

@kaitlynhair I only just realised that you obviously also have a function to load data in ASySD ... and am now wondering why we would duplicate that functionality here? Any reason why we should not use ASySD to read references? We will likely still run into trouble in edge cases, but then it only needs to be fixed in one place. What do you think?

TNRiley commented 1 year ago

@kaitlynhair I only just realised that you obviously also have a function to load data in ASySD ... and am now wondering why we would duplicate that functionality here? Any reason why we should not use ASySD to read references? We will likely still run into trouble in edge cases, but then it only needs to be fixed in one place. What do you think?

I think when I looked ASySD uses RefManageR for loading and can't remember if it handles .ris, I think just .bib

LukasWallrich commented 1 year ago

Ok - that's not versatile enough. So let's fix it here - but then the natural home might still be in ASySD, so @kaitlynhair might want to decide whether she wants to move that code over before we submit to CRAN?

LukasWallrich commented 1 year ago

@TNRiley in WoS, I looked for a couple of records with 'odd' fields - could you extract the RIS entries for them? I started with WoS, only to realise that they are exactly the ones where you did not upload the ris.

One other approach would be to have a look at invalid fields - those are probably the entries where things start to go wrong. In refsWoS, there are 114 DOI fields not containing DOIs - e.g., rows 68 and 83. Could you maybe try to import them separately, and if that works, then with as many in front of them until things go wrong? (To check whether there are DOIs, I used refsWoS %>% mutate(rownum = row_number()) %>% filter(!str_detect(doi, fixed("10."))) %>% pull(rownum)

TNRiley commented 1 year ago

In global (before UI), I added this line to help with max upload

options(shiny.maxRequestSize=1000*1024^2, timeout = 40000000)

I saw that the size script was removed from the app.R, but didn't see if you had added it elsewhere. Running options(shiny.maxRequestSize = 100010241024) locally I was able to upload all of the .ris files, however, I wasn't able to upload them without doing that first. We could inform users that they need to increase size limits in this way, but I think keeping the size limit large would be fine considering it worked without an issue once I set it locally. Thoughts?

LukasWallrich commented 1 year ago

I now added this in an .onLoad function (14c1f2a) so that you get 1.5 GB locally and 250 MB on shinyapps.io - happy to shift these in whatever direction we think is helpful, but that seems to be the best place for this option to be set.