ENCODE-DCC / croo

Cromwell output organizer
MIT License
13 stars 3 forks source link

Hello! I am learning to use croo to move data to an output webserver. I have a hopefully easy question #44

Closed methornton closed 1 year ago

methornton commented 1 year ago

Hi! Thank you for making such awesome and powerful software!

So I am testing the ATAC-seq pipeline on a remote server with new data and I also have a web server which we use for serving bamfiles for regular bulk RNA-seq that I was hoping to use to share the data from the pipeline. I am new to JSON and a biochemist and barely code. I try. I found the example output JSON for ATAC, "atac_croo.v4.json", so to test it I figured I would make a local out file for croo and I ran it locally, "croo /dataVol/data/17Jan23/ATACSeq/BJ/atac/3ef77992-cd38-4054-b1fd-02003977ff38/metadata.json --out-def-json /dataVol/data/17Jan23/ATACSeq/CROO_check/atac_croo.v4.json --out-dir /dataVol/data/17Jan23/ATACSeq/CROO_check/BJ/ --ucsc-genome-db hg38" This works and produces the html files like the example.

My newbie issue is that I want to then copy as little as possible to the web server, but if I copy the "out-dir" to the webserver, it still has the links to the data on the processing server, which become dead links there.

If I try to copy the processing folder (3ef77992-cd38-4054-b1fd-02003977ff38) to the webserver, it is gigantic and takes up all of the room

I think this is a simpler case than transferring from AWS buckets. I have sudo permissions on both servers. What would be the best way to do this? I would like to share the results and give them the opportunity(if they use it) to look at their data on the UCSC genome browser. The URL for our webserver is "https://nep.saban-chla.usc.edu/bamfiles/Initial_Data/ATAC/BJ/"

Any information or assistance is greatly appreciated. Thank you

Matt

leepc12 commented 1 year ago

https://nep.saban-chla.usc.edu/bamfiles/Initial_Data/ATAC/BJ/ is not opening. Can you post your metadata.json file? /dataVol/data/17Jan23/ATACSeq/BJ/atac/3ef77992-cd38-4054-b1fd-02003977ff38/metadata.json

You can make presigned URLs for files on a private bucket to visualize them on UCSC browser.

If you have processed all data on AWS then

# This will read credentials from "aws configure"
$ croo ... --use-presigned-url-s3

If you have processed all data on GCP then

# You need to define GCP JSON key for auth
$ croo ... --use-presigned-url-gcs --gcp-private-key YOUR_GCP_JSON_KEY_PATH
methornton commented 1 year ago

Hi thank you!

So I didn't process the data data on a cloud platform, I just used docker/cromwell to run the scripts on a processing computer. It takes a day or so on the processing computer, I could port in to the USC CARC, but we don't have to (yet).

The https://nep.saban-chla.usc.edu server is down currently. I need to harden it more. It may be better to share with AWS too.

Here is my metadata.json file metadata.txt

Thank you for your very kind assistance!

Sincerely,

Matt

methornton commented 1 year ago

The server is back up now

leepc12 commented 1 year ago

To visualize outputs on a local computer, you need to setup a web server and host those outputs. Your outputs will be public on the internet so use the following method only if your data can go public. Otherwise use a local genome visualizer.

Croo can map a local directory into a http URL directory. Define such directory mapping in a TSV file and define such TSV file like the following:

# from croo help
  --tsv-mapping-path-to-url TSV_MAPPING_PATH_TO_URL
                        A 2-column TSV file with local path prefix and corresponding URL
                        prefix. For example, using 1-line 2-col TSV file with
                        /var/www[TAB]http://my.server.com will replace a local path
                        /var/www/here/a.txt to a URL http://my.server.com/here/a.txt.

# actual command
$ croo your_metadata.json ... --tsv-mapping-path-to-url YOUR_TSV_FILE
methornton commented 1 year ago

Ok. So I still have to copy the entire pipeline output folder ("3ef77992-cd38-4054-b1fd-02003977ff38)" in the above example) to the webserver (/var/www/here/a.txt)? to then use croo to reformat the output to the web accessible folder? I was hoping not to have to do that. Is there a way to use croo to just extract the results and the files for visualizing on the UCSC genome browser and then copy them to the web server?

methornton commented 1 year ago

Hello! So I made 500Gb tar file and copied it to the webserver. i had to change some of the paths in the output metadata.json with vim, I then ran croo with: croo /data/met/atac_data/Deardorf/02Feb23/fa843551-20a2-4160-af6f-a31516820271/metadata.json --method copy --out-def-json /data/met/atac_data/Deardorf/02Feb23/atac_croo.v4.json --out-dir /data/met/www/nep.saban-chla.usc.edu/bamfiles/Deardorff/Initial_Data/ATAC/HEK293/ --tsv-mapping-path-to-url /data/met/atac_data/Deardorf/02Feb23/paths.tsv --ucsc-genome-db hg38

croo ran without error, but there is no flowchart (task graphs) or custom track files for loading into the UCSC genome browser (track hub). Can you help me find out what went wrong? Will this even work with '--method copy'?

You can look here at the output

https://nep.saban-chla.usc.edu/bamfiles/Deardorff/Initial_Data/ATAC/HEK293/

I would like to try to get as complete output as possible before we start generating data. I really appreciate any help that you can give. Thank you!

methornton commented 1 year ago

Hello!

I found that if I put the atac_croo.v4.json in the folder with the metadata json it will make the USCS genome browser tracks. I am still not getting the task graphs. Do I need to set a "basername". The genome browser tracks are the most necessary. It would be nice to make the task graph.

Here is the additional test data, https://nep.saban-chla.usc.edu/bamfiles/Deardorff/Initial_Data/ATAC/BJ/

leepc12 commented 1 year ago

Did you see any error complaining about graphviz things? Make sure that your system has dot executable (graphviz's binary) installed.

$ dot -V
dot - graphviz version 2.43.0 (0)

Please install it and run croo command again

$ sudo apt-get install graphviz
methornton commented 1 year ago

That was it. I installed graphviz and that task graphs were drawn. So I had to copy the file to a new location and then change the paths in the metadata.json in the processing folder and then the atac_croo.v4.json format file needs to be in the folder with the metadata.json and that paths.tsv file to correctly make the UCSC genome browser tracks. Thank you so much for your help!