Index FilteredNT on Pegasus for BLAST

HadleyKing commented 6 months ago

Apr 19, 2024, 5:09 PM

We would like to try to do an experiment on Pegasis with NCBI's NT db, which is about 1.2 TB last time I checked. We have a filtered version of that which is about 1.1 TB that I would need to transfer from our HIVE resources to Pegasis. I would like to use NCBI BLAST to create an index for the NT db. We think this process may take a week or so. I wanted to let someone know before I started it so that the job will not get killed.

Once we have the index we will move it out of Pegasus and delete the file. We need the indexed file to run some blasts and compare to will a smaller version of NT (80GB) that we have prepared. This is for a paper we are working on.

HadleyKing commented 6 months ago

On Mon, 22 Apr 2024 10:43:44 -0400, aklwong@gwu.edu wrote:

Forwarding Raja's email to our RT ticket system for a record and the future follow up.

Adam K. L. Wong, PhD. High Performance Computing Specialist for Genomics Research Technology Services, GWIT The George Washington University Email: aklwong@gwu.edu

---------- Forwarded message --------- From: Raja Mazumder [mazumder@gwu.edu](mailto:mazumder@gwu.edu) Date: Fri, Apr 19, 2024 at 5:09 PM Subject: Re: NCBI NT To: Charles Hadley King [hadley_king@email.gwu.edu](mailto:hadley_king@email.gwu.edu) Cc: Kai Leung Wong [aklwong@gwu.edu](mailto:aklwong@gwu.edu), Jonathon Keeney [keeneyjg@gwu.edu](mailto:keeneyjg@gwu.edu)

Hi Adam, Hope you are doing well. Once we have the index we will move it out of Pegasus and delete the file. We need the indexed file to run some blasts and compare to will a smaller version of NT (80GB) that we have prepared. This is for a paper we are working on. Many thanks, Raja

On Fri, Apr 19, 2024, 5:03 PM Charles Hadley King [hadley_king@email.gwu.edu](mailto:hadley_king@email.gwu.edu) wrote:

Adam, I would like to try to do an experiment on Pegasis with NCBI's NT db, which is about 1.2 TB last time I checked. We have a filtered version of that which is about 1.1 TB that I would need to transfer from our HIVE resources to Pegasis. I would like to use NCBI BLAST to create an index for the NT db. We think this process may take a week or so. I wanted to let someone know before I started it so that the job will not get killed.

Is there anything else I should do to protect this job once it is started?

Thank you,

Charles Hadley King, M.S. Research Scientist, HIVE Lab BioCompute Technical Lead The George Washington University Ross Hall 2300 Eye Street N.W. Washington, DC 20037 Mobile: 610-613-3063 hadley_king@gwu.edu https://orcid.org/0000-0003-1409-4549

Hi Raja and Charlies,

Thank you for contacting us for your request of using Pegasus to run your NCBI blast computation! I have created this ticket for a record and the future follow up.

@Charlies, if your are going to work with huge data, please make sure to select the "defq" partition which allows for up to 14 days runtime to run your jobs.

Best, Adam

penningtonea commented 6 months ago

New task for @penningtonea

create DB using BLAST on pegasus
Move 6 resulting files to location Tigran can access

penningtonea commented 6 months ago

From Joe:

Will use slimNT in place of filtered-NT for the time being for comparison of FASTA to BLAST. slimNT is smaller but is already on GW HPC.

penningtonea commented 6 months ago

Next step for NT project:

Index NT with blast, find documentation online for blast indexing and submit as slurm job.

Figure out if blast is on Pegasus.
Find documentation for slurm jobs is on GW website. Review Github issue and email for all information for the job.
Build test DB of 10 or so fasta files and move it to Pegasus. Run a test index with 10 fasta DB and see how it long it takes to run and fully index.
Scale up to 10, 100, 1000 and see how long it takes. Note the size of the DB. (e.g. Filtered NT is 1TB list of fasta files).
Move NT over to HIVE, perform several comparisons and present the results to Merck

Blockers:

[x] Access to pegasus
[x] Access to SMHS_BIOC
[x] Access to hivelab

rajamazumder commented 6 months ago

BLAST command to make database: makeblastdb –in mydb.fsa –dbtype nucl –parse_seqids

rajamazumder commented 6 months ago

BT006946.1 Homo sapiens cytochrome c, somatic mRNA, complete cds ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGTTCCCAGTGCCACACCGTTGAAAAGG GAGGCAAGCACAAGACTGGGCCAAATCTCCATGGTCTCTTTGGGCGGAAGACAGGTCAGGCCCCTGGATA CTCTTACACAGCCGCCAATAAGAACAAAGGCATCATCTGGGGAGAGGATACACTGATGGAGTATTTGGAG AATCCCAAGAAGTACATCCCTGGAACAAAAATGATCTTTGTCGGCATTAAGAAGAAGGAAGAAAGGGCAG ACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAG M20622.1 Rat somatic cytochrome c mRNA ATGGGTGATGTTGAAAAAGGCAAGAAGATTTTTGTTCAAAAGTGTGCCCAGTGCCACACTGTGGAAAAAG GAGGCAAGCATAAGACTGGACCAAACCTCCATGGTCTGTTTGGGCGGAAGACAGGCCAGGCTGCTGGATT CTCTTACACAGATGCCAACAAGAACAAAGGTATCACCTGGGGAGAGGATACCCTGATGGAGTATTTGGAA AATCCCAAAAAGTACATCCCTGGAACAAAAATGATCTTCGCTGGAATTAAGAAGAAGGGAGAAAGGGCAG ACCTAATAGCTTATCTTAAAAAGGCTACTAATGAATAA AK088098.1 Mus musculus 2 days neonate thymus thymic cells cDNA, RIKEN full-length enriched library, clone:E430004C08 product:cytochrome c, somatic, full insert sequence GAGAGCGCGGGACGTCTGTCTTCGAGTCCGAACGTTCGTGGTGTTGACCAGCCCGGAACGAATTAAAAAT GGGTGATGTTGAAAAAGGCAAGAAGATTTTTGTTCAGAAGTGTGCCCAGTGCCACACTGTGGAAAAGGGA GGCAAGCATAAGACTGGACCAAATCTCCACGGTCTGTTCGGGCGGAAGACAGGCCAGGCTGCTGGATTCT CTTACACAGATGCCAACAAGAACAAAGGCATCACCTGGGGAGAGGATACCCTGATGGAGTATTTGGAGAA TCCCAAAAAGTACATCCCTGGAACAAAAATGATCTTCGCTGGAATTAAGAAGGAGGGAGAAAGGGCAGAC CTAATAGCTTATCTTAAAAAGGCTACTAATGAGTAATTCCACTGCCTTATTTATTACAAAACAAATGTCT CATGGCTTTTAATGTACACCATAATTTAATTCACACACCAAATTCAGATCATGAATGGCTAGCAATGTTT TTGTTGGACAGTCCTGATTTAAGTAAAACTGACTTGTCATAAAGTGGGTACGGTCTTTATTAAAGCAACA GTTCCAGTTGTATACATGCTACCACGGCTCTCCCTTTCTCAAGATAAGATTGGACTTAATTAGCAATGTT TTACTTTCCATAAATAGGGGCATGTCACCTCAAACCTACTAAATGGTTTTATACTTAGATTTATATAACT GGGCATATGAATATGCTTAAACACTGGGAAAATTCTATCACTGTCTCAGAAACAAGAAGACTCAAATGTG TTTCAGTTGTGTTCACTGGCCTCTTTCAGGTCATGGCTAACCACCAGGAGGCAACTGTCTATTCTTGACA GTGCATTTTTAATTAGAATGTCTACATCAAGGATGTTGCCTTTACTATTGAAAGGCATTTACTTTTTTTT TTGTATGATATCAAATAAAGAGTATTTAACACTTTTT

penningtonea commented 6 months ago

more documentation: https://hpc.gwu.edu/documentation/getting-started-guide/ https://github.com/biocompute-objects/bcotool.git

penningtonea commented 6 months ago

Next task:

[x] run blast test using 3 seqdb and 1 test file for blast+ VM on personal laptop (EDIT: not necessary now but probably good practice for later)
[x] review lit for slurm jobs and try to execute there

rajamazumder commented 6 months ago

It looks like the command makeblastdb works on Keeney's computer. So @penningtonea please run the command via slurm. Ask Hadley or Keeney for help if you are stuck. For blast search use blastn -db databaseName -query queryFileName. Also check if there is -output you can use

Edit: spelling

kee007ney commented 6 months ago

Note for SCP transferring: You have to pull from hive-login, you cannot push from Pegasus. Make sure your public key is added in the correct place and pull the file, e.g.: scp pegasus.arc.gwu.edu:/SMHS/home/keeneyjg/sample.txt .

penningtonea commented 6 months ago

Progress update:

Connected to pegasus.
Wrote a bash script to make the database and run the blast search as written in above comments.
Executed slurm job ID 3121152.

The script did not work. Received the same error message from the call earlier - "Error: Too many positional arguments (1), the offending value: –in Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: –in". See screenshots below for additional information.

scontrol show job output:

output file with error message:

QUESTION: How can I change my account from Account=watkinslab and my GroupID from MG-watkinslab(1111) to our group? I no longer need access to the watkinslab account.

penningtonea commented 6 months ago

Downloading preformatted blast will negate the need to index NT on pegasus. Log in to HIVE API and execute the following commands received from NLM to download preformatted NT:

1) install standalone blast+ 2) use update_blastdb.pl command to download and extract all files needed for nt:

perl update_blastdb.pl --decompress nt Use perl update_blastdb.pl --help to see available switches.

3) run BLAST with 1 query sequence to test

rajamazumder commented 5 months ago

@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.

penningtonea commented 5 months ago

Instructions to download indexed NT from NCBI:

Hi,

Thanks for writing to us.

Make sure you have enough disk space. After install standalone blast+, you can use the update_blastdb.pl to download and extract all files needed for nt:

perl update_blastdb.pl --decompress nt

Do

perl update_blastdb.pl --help to see available switches

Regards,

penningtonea commented 5 months ago

Documenting email response: "Do not touch those files through further manual manipulation - they are ready to use as they are. Those files are tied together by an alias file (nt.nal) and you simply call the database with -db nt if you have BLASTDB variable set to point to the directory containing all those files.

If you do not, you will need to prefix the nt with direct path."

penningtonea commented 5 months ago

Was able to execute makeblastdb on pegasus. Output:

tested 1 sequence with manually inserted deletions in 3 sequence db. Output:

penningtonea commented 5 months ago

@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.

Working on downloading and indexing filtered-NT. Afterwards will scp the db to HIVE3.

penningtonea commented 5 months ago

@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.

Working on downloading and indexing filtered-NT. Afterwards will scp the db to HIVE3.

Mix-up with task. Will download NT on Pegasus today. Email sent to Adam Wong inquiring the best way to go about indexing NT via slurm scheduler due to the size of the job.

penningtonea commented 5 months ago

NT was downloaded and indexed on Pegasus. Directory: SMHS/home/epennington/lustre/groups/hivelab/emily/NT

Now downloading and indexing filtered-NT v7.0 on Pegasus.

HadleyKing commented 5 months ago

@penningtonea This is what I see. I am on the login node.

hadley_king@pegasus.arc.gwu.edu:/SMHS/groups/hivelab/filteredNT_v7.0/filteredNT.fasta

penningtonea commented 5 months ago

GW HPC Pegasus is not working. I am able to log in but unable to submit or allocate jobs. Indexing filtered_nt will resume when Pegasus is back up and running as usual.

penningtonea commented 4 months ago

I am moving this closed ticket to July task list to compare to similar ticket

rajamazumder commented 3 months ago

@HadleyKing and @penningtonea I was trying to follow this ticket. Can you tell me which ticket this was moved to?

penningtonea commented 1 month ago

@rajamazumder This ticket was tied to https://github.com/GW-HIVE/Platform/issues/93

HadleyKing commented 1 month ago

As of 2024-10-31 this is complete.

FilteredNT_v8.0 has been created(Oct 29 22:52), indexed (Oct 30 21:12), and tested (Oct 30 23:06).

Filtered NT: pegasus.arc.gwu.edu:/scratch/hivelab//blastdb/filteredNT_8.0/filteredNT_v8.0.fasta Indexed Filtered NT: pegasus.arc.gwu.edu:/scratch/hivelab//blastdb/filteredNT_8.0/indexed_filteredNT_v8.0

GW-HIVE / filtered_nt

Index FilteredNT on Pegasus for BLAST #18