broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
176 stars 88 forks source link

Upload BAM from local files #1194

Closed dmcgoldrick closed 4 years ago

dmcgoldrick commented 4 years ago

Hello --

First thanks for making seqr available on github! We are trying out the application -

I have made a tsv file for some BAMS and used the seqr upload links navigating from seqr:ProjectPage->Edit Datasets->Add BAM/CRAM Paths->

using and the Tab Seperated File(tsv) option the file "id2bam.tsv" is selected by seqr and we get

"Parsed 1 rows from id2bam.tsv" ...

using 1) tab 2) tab <file:// 3) tab 4) tab <file://

errors are 1) sym-linked and 3) actual path: Error updating 360372: error accessing "/tmp/" (400)

2) symlinked with file:// and 4) root path with file:// Error updating 360372: incomplete format (400)

I have not tried uploading from a google storage bucket yet. I know it is probably something really simple but I cannot get this two-column format to upload with an id and bam path from the interface:-/. Yes there is a bam index there too and so I think we don't have something configured correctly yet...

Then we have a local installation of elastic search and would like to use it with seqr are there instructions for ways to extract transform and upload to a (non-cloud) local elastic search instance from a local multi-vcf file or set of vcf files? Need help with the documentation and instructions on the functionality of seqr?

You can email me directly if this forum is not appropriate - we have a few other issues Can we get the system for python 3. Uploading test data in general how to set up our cloud vs local (preferred) How to do the elastic search/hail cloud parts...

thank you!

Daniel J McGoldrick mcgold@uw.edu

hanars commented 4 years ago

Hi Daniel,

I'm glad to hear you are trying out seqr. The format you want for the path is 3), the actual path. The check we do in python for the file is os.path.isfile(file_path) so what I would recommend is you open a python shell and run that command on the path and see if it succeeds. If not, there may be a typo in the path or an issue with your pythonpath setup and where it is looking for files.

The instructions for running the loading pipeline to get your data into you local elasticsearch can be found here: https://github.com/macarthur-lab/seqr/blob/master/deploy/LOCAL_INSTALL.md#step-5-load-dataset

Re: python 3 it is on our roadmap and we hope to have that released within the next few months.

Best, Hana Snow

dmcgoldrick commented 4 years ago

Hi Hana --

Thanks for the suggestion and link - this is indeed the step I am testing now :-)

I'll work on this more today and try your suggestions.

Daniel

On Thu, Feb 27, 2020 at 8:28 AM hanars notifications@github.com wrote:

Hi Daniel,

I'm glad to hear you are trying out seqr. The format you want for the path is 3), the actual path. The check we do in python for the file is os.path.isfile(file_path) so what I would recommend is you open a python shell and run that command on the path and see if it succeeds. If not, there may be a typo in the path or an issue with your pythonpath setup and where it is looking for files.

The instructions for running the loading pipeline to get your data into you local elasticsearch can be found here: https://github.com/macarthur-lab/seqr/blob/master/deploy/LOCAL_INSTALL.md#step-5-load-dataset

Re: python 3 it is on our roadmap and we hope to have that released within the next few months.

Best, Hana Snow

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/macarthur-lab/seqr/issues/1194?email_source=notifications&email_token=AELLUI66CXO5FKLER7C4SVLRE7SVZA5CNFSM4K4NW22KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENE7OUA#issuecomment-592050000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELLUI3YTDRDN472A5FIY4TRE7SVZANCNFSM4K4NW22A .

-- Daniel J McGoldrick Ph.D

UW Genome Sciences Center (GRC),

Center for Mendelian Genomics (CMG)

Box 355065Seattle, WA 98195(206) 685-7342

dmcgoldrick commented 4 years ago

Hello Hana

I used the non-sym linked paths and a non-restricted path and the BAM/IGV uploads worked nicely - so you can close this :-)

Also on the topic of uploading data I had to edit the apache spark properties file to add to the perll5lib path to get the code to see the perl JSON module - it was not in the perl install on the apache spark cluster that we launched by default? This allowed me to continue testing

python2.7 gcloud_dataproc/submit.py --run-locally hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --use-nested-objects-for-vep --use-nested-objects-for-genotypes $INPUT_VCF

I think it is working... just FYI about the perl module/spark in our hands

hanars commented 4 years ago

Glad this is working now and thanks for the FYI! If you reply with the modification you made to the config file and which OS you are using I can update our install scripts so others won't run into the same issue

dmcgoldrick commented 4 years ago

Hi Hana --

Here is what I did to get the vep apache spark cluster to see our perl5lib with the JSON module installed during local testing: (in blue)

the file is /vep/vep-gcloud-grch37.properties

hail.vep.perl = /usr/local/bin/perl hail.vep.perl5lib = /vep/loftee:/usr/share/perl5/vendor_perl:/home/nick-seqr/perl5/lib/perl5 hail.vep.location = /vep/variant_effect_predictor/ variant_effect_predictor.pl hail.vep.cache_dir = /vep

hail.vep.lof.human_ancestor =

/vep/loftee_data_grch37/loftee_data/human_ancestor.fa.gz

hail.vep.lof.conservation_file =

/vep/loftee_data_grch37/loftee_data/phylocsf.sql hail.vep.fasta = /vep/homo_sapiens/85_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa hail.vep.assembly = GRCh37 hail.vep.plugin = LoF,human_ancestor_fa:/vep/loftee_data_grch37/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,run_splice_predictions:0

,conservation_file:/vep/loftee_data_grch37/loftee_data/phylocsf.sql

best, Daniel

On Fri, Feb 28, 2020 at 6:33 AM hanars notifications@github.com wrote:

Glad this is working now and thanks for the FYI! If you reply with the modification you made to the config file and which OS you are using I can update our install scripts so others won't run into the same issue

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/macarthur-lab/seqr/issues/1194?email_source=notifications&email_token=AELLUIZCHW5VFPSUPVE4C6LRFEOBLA5CNFSM4K4NW22KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENIWZPI#issuecomment-592538813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELLUI5CGG7LOBS3MJFIOZDRFEOBLANCNFSM4K4NW22A .

-- Daniel J McGoldrick Ph.D

UW Genome Sciences Center (GRC),

Center for Mendelian Genomics (CMG)

Box 355065Seattle, WA 98195(206) 685-7342

dmcgoldrick commented 4 years ago

oh -- our operating system is CentOS 7

NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

On Fri, Feb 28, 2020 at 9:42 AM Daniel Joseph McGoldrick mcgold@uw.edu wrote:

Hi Hana --

Here is what I did to get the vep apache spark cluster to see our perl5lib with the JSON module installed during local testing: (in blue)

the file is /vep/vep-gcloud-grch37.properties

hail.vep.perl = /usr/local/bin/perl hail.vep.perl5lib = /vep/loftee:/usr/share/perl5/vendor_perl:/home/nick-seqr/perl5/lib/perl5 hail.vep.location = /vep/variant_effect_predictor/ variant_effect_predictor.pl hail.vep.cache_dir = /vep

hail.vep.lof.human_ancestor =

/vep/loftee_data_grch37/loftee_data/human_ancestor.fa.gz

hail.vep.lof.conservation_file =

/vep/loftee_data_grch37/loftee_data/phylocsf.sql hail.vep.fasta = /vep/homo_sapiens/85_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa hail.vep.assembly = GRCh37 hail.vep.plugin = LoF,human_ancestor_fa:/vep/loftee_data_grch37/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,run_splice_predictions:0

,conservation_file:/vep/loftee_data_grch37/loftee_data/phylocsf.sql

best, Daniel

On Fri, Feb 28, 2020 at 6:33 AM hanars notifications@github.com wrote:

Glad this is working now and thanks for the FYI! If you reply with the modification you made to the config file and which OS you are using I can update our install scripts so others won't run into the same issue

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/macarthur-lab/seqr/issues/1194?email_source=notifications&email_token=AELLUIZCHW5VFPSUPVE4C6LRFEOBLA5CNFSM4K4NW22KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENIWZPI#issuecomment-592538813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELLUI5CGG7LOBS3MJFIOZDRFEOBLANCNFSM4K4NW22A .

-- Daniel J McGoldrick Ph.D

UW Genome Sciences Center (GRC),

Center for Mendelian Genomics (CMG)

Box 355065Seattle, WA 98195(206) 685-7342

-- Daniel J McGoldrick Ph.D

UW Genome Sciences Center (GRC),

Center for Mendelian Genomics (CMG)

Box 355065Seattle, WA 98195(206) 685-7342