MoTrPAC / MotrpacBicQC

R package for the MoTrPAC community
https://motrpac.github.io/MotrpacBicQC/index.html
MIT License
8 stars 4 forks source link

system and stderr/stdout (on Windows) #199

Closed bcjaeger closed 1 year ago

bcjaeger commented 1 year ago

Hello,

Thank you for making this R package, it is very helpful.

I have access to the motrpac data hub and have also installed gsutils. However, I run into an issue when I try to use dl_read_gcp() locally on a Windows OS.

Strangely, it runs fine in reprex() (I found this out when I tried to make a reprex for this issue) but when I run it in a local R session on Rstudio I get something like this:

library(MotrpacBicQC)

fpath <- file.path("gs:/", 
                   "motrpac-data-hub", 
                   "human-precovid", 
                   "results", 
                   "transcriptomics",
                   'qa-qc')

fpath_file <- file.path(
  fpath, 
  'motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv'
)

tmp <- dl_read_gcp(path = fpath_file,
                   sep = ',',
                   tmpdir = "D:/hap-p-sed-modeling/data/sensitive")

Warning message:
In dl_read_gcp(path = fpath_file, sep = ",", tmpdir = "D:/hap-p-sed-modeling/data/sensitive") :
  gsutil file gs://motrpac-data-hub/human-precovid/results/transcriptomics/qa-qc/motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv does not exist.

@sawyerWeld, @joerigdon, @cgsimmons822, @SHChen0, and @fchsu6 have helped me troubleshoot this. It's possible that there is something happening with system that could be fixed by using system2, but we haven't figured out what yet.

We did find that the dl_read_gcp() function runs locally for me when I set either ignore.stdout OR ignore.stderr to FALSE in the system call here:

system(cmd,  ignore.stdout = TRUE, ignore.stderr = TRUE)

But I have no idea why that fixes the issue.

biodavidjm commented 1 year ago

HI @bcjaeger ,

I have two questions:

  1. How do you run the gsutil cp gs://motrpac-data-hub/human-precovid/results/transcriptomics/qa-qc/motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv . command locally on Windows? i.e., how would you cp that file on your windows system using gsutil?

  2. Have you tried the following command? if so, does it work?

tmp <- dl_read_gcp(path = "gs://motrpac-data-hub/human-precovid/results/transcriptomics/qa-qc/motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv",
                   tmpdir = "D:/hap-p-sed-modeling/data/sensitive")
bcjaeger commented 1 year ago

Thanks!

How do you run the gsutil cp gs://motrpac-data-hub/human-precovid/results/transcriptomics/qa-qc/motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv . command locally on Windows? i.e., how would you cp that file on your windows system using gsutil?

If not using R, I would use the command prompt:

image

If using R, just calling system() with the corresponding terminal code works fine.

Have you tried the following command? if so, does it work?

Thanks for checking. The same two results occur with this code. I.e., it runs fine when I do a reprex()

library(MotrpacBicQC)

tmp <- dl_read_gcp(path = "gs://motrpac-data-hub/human-precovid/results/transcriptomics/qa-qc/motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv",
                   tmpdir = "D:/hap-p-sed-modeling/data/sensitive")
#> Warning in system(sprintf("mkdir -p %s", tmpdir)): 'mkdir' not found

Created on 2023-05-08 with reprex v2.0.2

The warning about mkdir is just a windows thing I think, and doesn't cause any problems since the directory I'm specifying already exists. But then here is the funny part - it doesn't work if I run it in my local R session

image

Sorry this issue is so strange. I am afraid it is more of a Windows issue than it is a MotrpacBicQC issue.

araskind commented 1 year ago

It is pretty straightforward to quickly upload the data using command line (DOS window) if you have gsutils installed. The command for uploading the whole folder recursively looks something like that: gsutil -m cp -r "Y:\DataAnalysis_Reports\EX00979 - PASS 1B\PASS1B-06*" "gs://motrpac-portal-transfer-michigan/PASS1B-06"

To download you just switch the source and destination

bcjaeger commented 1 year ago

To download you just switch the source and destination

Thank you! I am able to download the files I need, which is great. Even though I can get those files, it may be helpful to update dl_read_gcp so that it works consistently on Windows and continues to work fine on other systems too. (I am trying to research why system has unexpected errors on Windows)

biodavidjm commented 1 year ago

Thanks @bcjaeger

Question 1) was about how you do it outside of R. Does this command work on your side?

gsutil cp gs://motrpac-data-hub/human-precovid/results/transcriptomics/qa-qc/motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv D:/hap-p-sed-modeling/data/sensitive

Please, confirm (make sure you don't have the motrpac_human-precovid_transcript-rna-seq_qa-qc-metrics.csv file in that directory already)

However, after seeing this warning that you provided in your response:

Warning in system(sprintf("mkdir -p %s", tmpdir)): 'mkdir' not found

I am afraid that could be the issue. For gsutil we don't have any other option but calling system in R. However, we should use R's dir.create() command to create a directory (and the OS won't matter)

So please, confirm the answer to the first question.

bcjaeger commented 1 year ago

Thanks - confirmed.

image

I am afraid that could be the issue. For gsutil we don't have any other option but calling system in R. However, we should use R's dir.create() command to create a directory (and the OS won't matter)

mkdir could be an issue if the user was hoping to create the directory where they wanted to download data, and using dir.create() seems like a great idea.

I have a fix that should only change how this function works on Windows. It takes the ignore arguments and sets them to FALSE if the operating system is Windows. I think the issue I'm having with dl_read_gcp() is hard to explain - I would be happy to share screen over zoom to clarify if that's helpful.

dl_read_gcp <- 
  function (path, sep = "\t", header = TRUE, tmpdir = "/tmp", 
            gsutil_path = "gsutil", check_first = TRUE, ...){

  # additions from bcjaeger:
  sys_name <- Sys.info()['sysname']
  ignore_stdout <- ignore_stderr <- sys_name != "Windows"

  system(sprintf("mkdir -p %s", tmpdir))
  new_path <- sprintf("%s/%s", tmpdir, basename(path))
  if (check_first) {
    if (!file.exists(new_path)) {
      cmd <- sprintf("%s cp %s %s", gsutil_path, 
                     path, tmpdir)
      system(cmd, 
             ignore.stdout = ignore_stdout, 
             ignore.stderr = ignore_stderr)
    }
    else {
      message(paste("The file", new_path, "already exists"))
    }
  }
  else {
    message(paste("Downloading file from GCP: ", basename(path)))
    cmd <- sprintf("%s cp %s %s", gsutil_path, path, 
                   tmpdir)
    system(cmd, 
           ignore.stdout = ignore_stdout, 
           ignore.stderr = ignore_stderr)
  }
  if (file.exists(new_path)) {
    dt <- data.table::fread(new_path, sep = sep, header = header, 
                            ...)
    return(dt)
  }
  warning(sprintf("gsutil file %s does not exist.\n", 
                  path))
  return()
  }
biodavidjm commented 1 year ago

Thanks, I'll get back to you with my proposed fix

bcjaeger commented 1 year ago

If you'd like, I can propose a fix with a PR? It could include my Windows code above and the use of dir.create() as you noted above. I don't want to take up too much of your time with my weird OS problems =]

bcjaeger commented 1 year ago

Fixed with #202 and #203 🎉