Closed HadleyKing closed 1 month ago
On Mon, 22 Apr 2024 10:43:44 -0400, aklwong@gwu.edu wrote:
Forwarding Raja's email to our RT ticket system for a record and the future follow up.
Adam K. L. Wong, PhD. High Performance Computing Specialist for Genomics Research Technology Services, GWIT The George Washington University Email: aklwong@gwu.edu
---------- Forwarded message --------- From: Raja Mazumder [mazumder@gwu.edu](mailto:mazumder@gwu.edu) Date: Fri, Apr 19, 2024 at 5:09 PM Subject: Re: NCBI NT To: Charles Hadley King [hadley_king@email.gwu.edu](mailto:hadley_king@email.gwu.edu) Cc: Kai Leung Wong [aklwong@gwu.edu](mailto:aklwong@gwu.edu), Jonathon Keeney [keeneyjg@gwu.edu](mailto:keeneyjg@gwu.edu)
Hi Adam, Hope you are doing well. Once we have the index we will move it out of Pegasus and delete the file. We need the indexed file to run some blasts and compare to will a smaller version of NT (80GB) that we have prepared. This is for a paper we are working on. Many thanks, Raja
On Fri, Apr 19, 2024, 5:03 PM Charles Hadley King [hadley_king@email.gwu.edu](mailto:hadley_king@email.gwu.edu) wrote:
Adam, I would like to try to do an experiment on Pegasis with NCBI's NT db, which is about 1.2 TB last time I checked. We have a filtered version of that which is about 1.1 TB that I would need to transfer from our HIVE resources to Pegasis. I would like to use NCBI BLAST to create an index for the NT db. We think this process may take a week or so. I wanted to let someone know before I started it so that the job will not get killed.
Is there anything else I should do to protect this job once it is started?
Thank you,
Charles Hadley King, M.S. Research Scientist, HIVE Lab BioCompute Technical Lead The George Washington University Ross Hall 2300 Eye Street N.W. Washington, DC 20037 Mobile: 610-613-3063 hadley_king@gwu.edu https://orcid.org/0000-0003-1409-4549
Hi Raja and Charlies,
Thank you for contacting us for your request of using Pegasus to run your NCBI blast computation! I have created this ticket for a record and the future follow up.
@Charlies, if your are going to work with huge data, please make sure to select the "defq" partition which allows for up to 14 days runtime to run your jobs.
Best, Adam
New task for @penningtonea
From Joe:
Will use slimNT in place of filtered-NT for the time being for comparison of FASTA to BLAST. slimNT is smaller but is already on GW HPC.
Next step for NT project:
Index NT with blast, find documentation online for blast indexing and submit as slurm job.
Figure out if blast is on Pegasus.
Find documentation for slurm jobs is on GW website. Review Github issue and email for all information for the job.
Build test DB of 10 or so fasta files and move it to Pegasus. Run a test index with 10 fasta DB and see how it long it takes to run and fully index.
Scale up to 10, 100, 1000 and see how long it takes. Note the size of the DB. (e.g. Filtered NT is 1TB list of fasta files).
Move NT over to HIVE, perform several comparisons and present the results to Merck
Blockers:
BLAST command to make database: makeblastdb –in mydb.fsa –dbtype nucl –parse_seqids
BT006946.1 Homo sapiens cytochrome c, somatic mRNA, complete cds ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGTTCCCAGTGCCACACCGTTGAAAAGG GAGGCAAGCACAAGACTGGGCCAAATCTCCATGGTCTCTTTGGGCGGAAGACAGGTCAGGCCCCTGGATA CTCTTACACAGCCGCCAATAAGAACAAAGGCATCATCTGGGGAGAGGATACACTGATGGAGTATTTGGAG AATCCCAAGAAGTACATCCCTGGAACAAAAATGATCTTTGTCGGCATTAAGAAGAAGGAAGAAAGGGCAG ACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAG M20622.1 Rat somatic cytochrome c mRNA ATGGGTGATGTTGAAAAAGGCAAGAAGATTTTTGTTCAAAAGTGTGCCCAGTGCCACACTGTGGAAAAAG GAGGCAAGCATAAGACTGGACCAAACCTCCATGGTCTGTTTGGGCGGAAGACAGGCCAGGCTGCTGGATT CTCTTACACAGATGCCAACAAGAACAAAGGTATCACCTGGGGAGAGGATACCCTGATGGAGTATTTGGAA AATCCCAAAAAGTACATCCCTGGAACAAAAATGATCTTCGCTGGAATTAAGAAGAAGGGAGAAAGGGCAG ACCTAATAGCTTATCTTAAAAAGGCTACTAATGAATAA AK088098.1 Mus musculus 2 days neonate thymus thymic cells cDNA, RIKEN full-length enriched library, clone:E430004C08 product:cytochrome c, somatic, full insert sequence GAGAGCGCGGGACGTCTGTCTTCGAGTCCGAACGTTCGTGGTGTTGACCAGCCCGGAACGAATTAAAAAT GGGTGATGTTGAAAAAGGCAAGAAGATTTTTGTTCAGAAGTGTGCCCAGTGCCACACTGTGGAAAAGGGA GGCAAGCATAAGACTGGACCAAATCTCCACGGTCTGTTCGGGCGGAAGACAGGCCAGGCTGCTGGATTCT CTTACACAGATGCCAACAAGAACAAAGGCATCACCTGGGGAGAGGATACCCTGATGGAGTATTTGGAGAA TCCCAAAAAGTACATCCCTGGAACAAAAATGATCTTCGCTGGAATTAAGAAGGAGGGAGAAAGGGCAGAC CTAATAGCTTATCTTAAAAAGGCTACTAATGAGTAATTCCACTGCCTTATTTATTACAAAACAAATGTCT CATGGCTTTTAATGTACACCATAATTTAATTCACACACCAAATTCAGATCATGAATGGCTAGCAATGTTT TTGTTGGACAGTCCTGATTTAAGTAAAACTGACTTGTCATAAAGTGGGTACGGTCTTTATTAAAGCAACA GTTCCAGTTGTATACATGCTACCACGGCTCTCCCTTTCTCAAGATAAGATTGGACTTAATTAGCAATGTT TTACTTTCCATAAATAGGGGCATGTCACCTCAAACCTACTAAATGGTTTTATACTTAGATTTATATAACT GGGCATATGAATATGCTTAAACACTGGGAAAATTCTATCACTGTCTCAGAAACAAGAAGACTCAAATGTG TTTCAGTTGTGTTCACTGGCCTCTTTCAGGTCATGGCTAACCACCAGGAGGCAACTGTCTATTCTTGACA GTGCATTTTTAATTAGAATGTCTACATCAAGGATGTTGCCTTTACTATTGAAAGGCATTTACTTTTTTTT TTGTATGATATCAAATAAAGAGTATTTAACACTTTTT
Next task:
It looks like the command makeblastdb works on Keeney's computer. So @penningtonea please run the command via slurm. Ask Hadley or Keeney for help if you are stuck. For blast search use blastn -db databaseName -query queryFileName. Also check if there is -output you can use
Edit: spelling
Note for SCP transferring:
You have to pull from hive-login, you cannot push from Pegasus. Make sure your public key is added in the correct place and pull the file, e.g.:
scp pegasus.arc.gwu.edu:/SMHS/home/keeneyjg/sample.txt .
Progress update:
The script did not work. Received the same error message from the call earlier - "Error: Too many positional arguments (1), the offending value: –in Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: –in". See screenshots below for additional information.
scontrol show job output:
output file with error message:
QUESTION: How can I change my account from Account=watkinslab and my GroupID from MG-watkinslab(1111) to our group? I no longer need access to the watkinslab account.
Downloading preformatted blast will negate the need to index NT on pegasus. Log in to HIVE API and execute the following commands received from NLM to download preformatted NT:
1) install standalone blast+ 2) use update_blastdb.pl command to download and extract all files needed for nt:
perl update_blastdb.pl --decompress nt Use perl update_blastdb.pl --help to see available switches.
3) run BLAST with 1 query sequence to test
@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.
Instructions to download indexed NT from NCBI:
Hi,
Thanks for writing to us.
Make sure you have enough disk space. After install standalone blast+, you can use the update_blastdb.pl to download and extract all files needed for nt:
perl update_blastdb.pl --decompress nt
Do
perl update_blastdb.pl --help to see available switches
Regards,
Documenting email response: "Do not touch those files through further manual manipulation - they are ready to use as they are. Those files are tied together by an alias file (nt.nal) and you simply call the database with -db nt if you have BLASTDB variable set to point to the directory containing all those files.
If you do not, you will need to prefix the nt with direct path."
Was able to execute makeblastdb on pegasus. Output:
tested 1 sequence with manually inserted deletions in 3 sequence db. Output:
@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.
Working on downloading and indexing filtered-NT. Afterwards will scp the db to HIVE3.
@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.
Working on downloading and indexing filtered-NT. Afterwards will scp the db to HIVE3.
Mix-up with task. Will download NT on Pegasus today. Email sent to Adam Wong inquiring the best way to go about indexing NT via slurm scheduler due to the size of the job.
NT was downloaded and indexed on Pegasus. Directory: SMHS/home/epennington/lustre/groups/hivelab/emily/NT
Now downloading and indexing filtered-NT v7.0 on Pegasus.
@penningtonea This is what I see. I am on the login node.
hadley_king@pegasus.arc.gwu.edu:/SMHS/groups/hivelab/filteredNT_v7.0/filteredNT.fasta
GW HPC Pegasus is not working. I am able to log in but unable to submit or allocate jobs. Indexing filtered_nt will resume when Pegasus is back up and running as usual.
I am moving this closed ticket to July task list to compare to similar ticket
@HadleyKing and @penningtonea I was trying to follow this ticket. Can you tell me which ticket this was moved to?
@rajamazumder This ticket was tied to https://github.com/GW-HIVE/Platform/issues/93
As of 2024-10-31 this is complete.
FilteredNT_v8.0 has been created(Oct 29 22:52), indexed (Oct 30 21:12), and tested (Oct 30 23:06).
Filtered NT: pegasus.arc.gwu.edu:/scratch/hivelab//blastdb/filteredNT_8.0/filteredNT_v8.0.fasta
Indexed Filtered NT: pegasus.arc.gwu.edu:/scratch/hivelab//blastdb/filteredNT_8.0/indexed_filteredNT_v8.0
Apr 19, 2024, 5:09 PM
We would like to try to do an experiment on Pegasis with NCBI's NT db, which is about 1.2 TB last time I checked. We have a filtered version of that which is about 1.1 TB that I would need to transfer from our HIVE resources to Pegasis. I would like to use NCBI BLAST to create an index for the NT db. We think this process may take a week or so. I wanted to let someone know before I started it so that the job will not get killed.
Once we have the index we will move it out of Pegasus and delete the file. We need the indexed file to run some blasts and compare to will a smaller version of NT (80GB) that we have prepared. This is for a paper we are working on.