Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
183 stars 51 forks source link

Scaling up runs into throttling/blocks from www.ebi.ac.uk #174

Open moskalenko opened 1 year ago

moskalenko commented 1 year ago

Hi. I have a client who's trying to run a few hundred ExpansionHunter analyses at the same time. Unfortunately, all ExpansionHunter analyses beyond a relatively small set stop because of hung http requests to www.ebi.ac.uk. Stracing a single test job showed the request below. The user is using a local GRCh38_full_analysis_set_plus_decoy_hla.fa reference file, but the path is different from what's coming up in the strace, which is confusing. We're not sure where "/gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa" read attempt is coming from in ExpansionHunter, but the www.ebi.ac.uk hit seems to be done by EH because of a missing reference. If there's a known workaround for preventing a storm out outgoing requests to www.ebi.ac.uk please let me know. I'd be happy to host whatever reference data is needed locally. Alternatively, if there's a way to force ExpansionHunter to skip the ids with no local reference it would work, too.

Thanks,

Alex

stat("/gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa", 0x7ffe7a067ff0) = -1 ENOENT (No such file or directory) write(2, "Failed to populate reference for id 2387\n", 41) = 41 stat("/home/jdoe/.cache/hts-ref/88/49/c9f185b5ae8ed6d60d3b99c6591c", 0x7ffe7a06c120) = -1 ENOENT (No such file or directory) stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=93, ...}) = 0 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 2802 fstat(2802, {st_mode=S_IFREG|0644, st_size=329, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b73ce1aa000 read(2802, "# HEADER: This file was autogenerated at 2022-12-22 06:48:12 -0500\n# HEADER: by puppet. While it can still be managed manually, it\n# HEADER: is definitely not recommended.\n127.0.0.1\tlocalhost.localdo"..., 4096) = 329 read(2802, "", 4096) = 0 close(2802) = 0 munmap(0x2b73ce1aa000, 4096) = 0 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 2802 setsockopt(2802, SOL_IP, IP_RECVERR, [1], 4) = 0 connect(2802, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.16.207.246")}, 16) = 0 poll([{fd=2802, events=POLLOUT}], 1, 0) = 1 ([{fd=2802, revents=POLLOUT}]) sendmmsg(2802, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\325\1\0\0\1\0\0\0\0\0\0\3www\3ebi\2ac\2uk\0\0\1\0\1", iov_len=31}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=31}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="h\334\1\0\0\1\0\0\0\0\0\0\3www\3ebi\2ac\2uk\0\0\34\0\1", iov_len=31}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=31}], 2, MSG_NOSIGNAL) = 2 poll([{fd=2802, events=POLLIN}], 1, 5000) = 1 ([{fd=2802, revents=POLLIN}]) ioctl(2802, FIONREAD, [136]) = 0 recvfrom(2802, "h\334\201\200\0\1\0\1\0\1\0\0\3www\3ebi\2ac\2uk\0\0\34\0\1\300\f\0\5\0\1\0\0\0\3\0\10\3www\1g\300\20\300/\0\6\0\1\0\0\1\4\0I\7ns-1300\tawsdns-34\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34 \0\0\3\204\0\22u\0\0\1Q\200", 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.16.207.246")}, [28->16]) = 136 poll([{fd=2802, events=POLLIN}], 1, 4999) = 1 ([{fd=2802, revents=POLLIN}]) ioctl(2802, FIONREAD, [381]) = 0 recvfrom(2802, "\325\201\200\0\1\0\2\0\4\0\10\3www\3ebi\2ac\2uk\0\0\1\0\1\300\f\0\5\0\1\0\0\0\3\0\10\3www\1g\300\20\300+\0\1\0\1\0\0\0D\0\4\301>\301P\300/\0\2\0\1\0\0\36\236\0\27\7ns-1300\tawsdns-34\3org\0\300/\0\2\0\1\0\0\36\236\0\26\6ns-434\tawsdns-54\3com\0\300/\0\2\0\1\0\0\36\236\0\27\7ns-1592\tawsdns-07\2co\300\27\300/\0\2\0\1\0\0\36\236\0\26\6ns-953\tawsdns-55"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.16.207.246")}, [28->16]) = 381 close(2802) = 0

dennishendriksen commented 1 year ago

Hello @moskalenko, could it be that ExpansionHunter uses samtools under the hood and you are running into the same issue as https://github.com/HKU-BAL/Clair3/issues/180? In that case setting the environment variable REF_PATH=: might prevent the requests. Context: https://www.htslib.org/doc/samtools.html#REFERENCE_SEQUENCES

moskalenko commented 1 year ago

Hello @moskalenko, could it be that ExpansionHunter uses samtools under the hood and you are running into the same issue as HKU-BAL/Clair3#180? In that case setting the environment variable REF_PATH=: might prevent the requests. Context: https://www.htslib.org/doc/samtools.html#REFERENCE_SEQUENCES

You are right! The samtools was not available in the expansionhunter environment. I've added it and will ask the user to run a test.