bzhanglab / OmicsEV

A tool for large scale omics datasets evaluation
https://bzhanglab.github.io/OmicsEV/
24 stars 4 forks source link

Error with OmicsEV and port 22? #5

Closed toddcreasy closed 1 year ago

toddcreasy commented 1 year ago

Hi,

I'm getting an error that I'm having trouble debugging when running OmicsEV in the container given. At the very bottom, it says it's trying to connect to a host on port 22 and times out. Is there something that I'm supposed to configure?

`Registered S3 method overwritten by 'caret': method from print.plsda DiscriMiner

species: human Wed Feb 1 16:13:28 2023: import data ...

Reset data! Convert <=0 to NA:128338 Data type: gene Total peaks: 18845 Remove peaks which the percent is more than 0.5 with intensity are NA! 1429 Peaks left: 17416 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/datasets//d1.tsv , total features: 18845 50% missing: 17416 Reset data! Convert <=0 to NA:0 Data type: gene Total peaks: 18845 Remove peaks which the percent is more than 0.5 with intensity are NA! 1429 Peaks left: 17416 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/datasets//d2.tsv , total features: 18845 50% missing: 17416 Reset data! Convert <=0 to NA:0 Data type: gene Total peaks: 18845 Remove peaks which the percent is more than 0.5 with intensity are NA! 1429 Peaks left: 17416 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/datasets//d3.tsv , total features: 18845 50% missing: 17416 Reset data! Convert <=0 to NA:128338 Data type: gene Total peaks: 18845 Remove peaks which the percent is more than 0.5 with intensity are NA! 1429 Peaks left: 17416 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/datasets//d4.tsv , total features: 18845 50% missing: 17416 Reset data! Convert <=0 to NA:0 Data type: gene Total peaks: 18845 Remove peaks which the percent is more than 0.5 with intensity are NA! 1429 Peaks left: 17416 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/datasets//d5.tsv , total features: 18845 50% missing: 17416 Reset data! Convert <=0 to NA:128338 Data type: gene Total peaks: 18845 Remove peaks which the percent is more than 0.5 with intensity are NA! 1429 Peaks left: 17416 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/datasets//d6.tsv , total features: 18845 50% missing: 17416 Reset data! Convert <=0 to NA:0 Data type: protein Total peaks: 9733 Remove peaks which the percent is more than 0.5 with intensity are NA! 334 Peaks left: 9399 Save the removed peaks to file: .//metaX-filterPeaks file: /root/workspace/OmicsEV/example_data/protein.tsv , total features: 9733 50% missing: 9399 Wed Feb 1 16:14:21 2023: calculate basic metrics for each dataset ...

── Column specification ──────────────────────────────────────────────────────── cols( class = col_character(), col = col_character() )

Use cpu: 6 ssh: connect to host 0.0.0.6 port 22: Connection timed out `

toddcreasy commented 1 year ago

I should also point out I'm trying to run OmicsEV via an Rscript that simply takes in some arguments and calls run_omics_evaluation()

./bin/run_OmicsEV.R --data_dir=example_data/datasets/ --sample_list example_data/sample_list.tsv --x2 example_data/protein.tsv --x2_label Protein --cpu 6 --use_existing_data TRUE --data_type gene --class_for_ml example_data/sample_ml.tsv --out_dir=.

wenbostar commented 1 year ago

Did you run this on a remote server?

toddcreasy commented 1 year ago

I'm running it on an EC2 instance within the docker container. Proxies are set properly. I thought it might be the port wasn't open but I tried docker run with the "-p 22" option but that didn't work. I'm wondering what the difference would be between launching this in an Rscript and running within an R session (which works).

I'm playing around with this using nextflow so maybe there's something I'm missing. Here is my fork:

https://github.com/toddcreasy/OmicsEV

toddcreasy commented 1 year ago

Hi @wenbostar, I was wondering if you had any thoughts about my issue. Basically the error only occurs when you try to run OmicsEV called from an Rscript. When you're in a R session, it works just fine. I understand this is somewhat out of your hands as it's an outside use case but if you have any thoughts let me know!

wenbostar commented 1 year ago

I almost always run OmicsEV through Rscript and I have not encountered this issue. I'm not sure what causes that. I never set the parameter -p when I run OmicsEV docker on EC2 instance. Could you send me the command line that you launch the OmicsEV docker as well as the command line to run your R script? I can try if I can reproduce the issue on my side.

toddcreasy commented 1 year ago

docker run -it -v /root/workspace/OmicsEV/:/opt/ proteomics/omicsev

I created an Rscript like this:

library(docopt)

args <- docopt(doc, version = 'Run OmicsEV v1.0')
args$use_existing_data <- as.logical(toupper(args$use_existing_data))

print("ARGS:")
print(args)

library(OmicsEV)
run_omics_evaluation(data_dir = args$data_dir,
                     sample_list = args$sample_list,
                     x2 = args$x2,
                     x2_label = args$x2_label,
                     cpu = args$cpu,
                     use_existing_data = args$use_existing_data,
                     data_type = args$data_type,
                     class_for_ml = args$class_for_ml,
                     out_dir = args$out_dir)

I run it like this:

root@954c7a599043:/opt# ./bin/run_OmicsEV.R --data_dir=example_data/datasets/ --sample_list example_data/sample_list.tsv --x2 example_data/protein.tsv --x2_label Protein --cpu 6 --use_existing_data TRUE --data_type gene --class_for_ml example_data/sample_ml.tsv --out_dir=.

And the output looks like what I posted originally. It gets to the line that prints "Use cpu: 6" then throws that error and just hangs. No R process is running in the background. It's almost like it's waiting for something?

When I kill it with Ctrl-C, I see this:

^C
Warning message:
Removed 770028 rows containing non-finite values (stat_density).
Execution halted

When I run it within a R session, everything runs smoothly.

Thanks for spending time on this!

toddcreasy commented 1 year ago

Just a quick thought. Do you know what code is trying to run after this output from above (before the error/hang up)? The reason I ask is when I run it within an R session (which works), I get a lot of calls to bioconductor. See further below. I feel like there is got to be some kind of network issue but I'm struggling to debug it.

Peaks left: 9399
Save the removed peaks to file: .//metaX-filterPeaks
file: /root/workspace/OmicsEV/example_data/protein.tsv , total features: 9733 50% missing: 9399
Wed Feb 1 16:14:21 2023: calculate basic metrics for each dataset ...

── Column specification ────────────────────────────────────────────────────────
cols(
class = col_character(),
col = col_character()
)

Use cpu: 6

--- what is called here? ---

This is from within the R session:

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  class = col_character(),
  col = col_character()
)

Use cpu: 6
Bioconductor version '3.12' is out-of-date; the current release version '3.16'
  is available with R version '4.2'; see https://bioconductor.org/install
Bioconductor version '3.12' is out-of-date; the current release version '3.16'
  is available with R version '4.2'; see https://bioconductor.org/install
**.... this is repeated like 40 times ...**
Batch number: 2
Wed Feb  8 18:41:50 2023: batch effect evaluation ...

Missing value imputation ...
Total peaks: 18845
Remove peaks which the percent is more than 0.5 with intensity are NA!
.....
iblacksand commented 1 year ago

Just adding some links/suggestions that may be helpful to diagnose the issue.

The fact that Rscript is behaving differently than the R terminal may suggest that it is something to do with the R global environment. The R environment from Rscript is new everytime, while the R terminal typically will inherit the environment from a previous run, and this may explain the difference. Not sure why this would cause an SSH error. It might be the working directory that is set or some default values from your R environment.

It might also be a docker issue which I found in this StackExchange answer: https://stackoverflow.com/a/71527234.

If none of this works and you would like to be able to run the script from command line you could always use

R -e "library(OmicsEV)
run_omics_evaluation(data_dir = 'example_data/datasets',
sample_list = 'example_data/sample_list.tsv',
x2 = 'example_data/protein.tsv',
x2_label = 'Protein',
cpu = 6,
use_existing_data = TRUE,
data_type = 'gene',
class_for_ml = 'example_data/sample_ml.tsv',
out_dir = getwd()"

which would run the script in the R console. Wrapping this in your existing command script, you could do

library(docopt)

args <- docopt(doc, version = 'Run OmicsEV v1.0')
args$use_existing_data <- as.logical(toupper(args$use_existing_data))

print("ARGS:")
print(args)

command <- "R -e \"library(OmicsEV)
run_omics_evaluation(data_dir = '%s',sample_list = '%s',x2 = '%s',x2_label = '%s',cpu = %s,use_existing_data = %s,data_type = '%s',class_for_ml = '%s',out_dir = '%s')\""
command <- sprintf(command, args$data_dir,args$sample_list,args$x2,args$x2_label,args$cpu, args$use_existing_data, args$data_type,args$class_for_ml,args$out_dir)
system(command)

It's a little messy but it should work.

toddcreasy commented 1 year ago

Thanks, @iblacksand, your solution did indeed work! I will continue to explore the cause of my Rscript issue but at least I have something in place to continue my work. I've looked at all of the env variables and they are identical so now I've made it my mission to figure this out :)

iblacksand commented 1 year ago

Glad it worked out! Good luck with finding the issue, it seems like it's a tricky one.