Closed mahesh-panchal closed 10 months ago
Possible solutions:
blast_db_fasta:
Interproscan:
Take few sequences from second row in that file blastp.merged.gz
e.g: in this header sp|Q8TGK7|YAG8_YEAST the uniq ID is Q8TGK7
.
Take few sequences from second row in that file blastp.merged.gz
e.g: in this header sp|Q8TGK7|YAG8_YEAST the uniq ID is
Q8TGK7
.
Do you mean the 2nd column? What is that exactly? A blast of one of the sequences or all?
* blast_db_fasta create a test protein file should not be complex, we can have a small one of a specific species. on which species is the test?
My plan at the moment is to retrieve the contents of this link.
https://www.uniprot.org/uniprot/?query=organism:4932&format=fasta
* interproscan I am not sure, if we skip it then it is impossible to test the pipeline has it will be only a blast. And for the 2 other points, I don't know how feasible it is, maybe as the docker container is already ready it can be easier.
That's a good point. Then let's try and use the docker image. It should be possible to alias the interproscan command to the docker image run command somehow which will make it suitable for the conda environment.
Sorry yes I meant 2nd column.
The species seems to be Yeast. The gff annotation file just contains few gene models we must be sure that few of them match this list https://www.uniprot.org/uniprot/?query=organism:4932&format=fasta
.
And if we want to minimise a maximum, just take few of the best hit from the result I sent you.
this list is enough, we can even take only the first 20.
O13588
Q8TGK7
Q3E770
Q8TGK6
P39709
P39710
O13511
O13512
Q6B2U8
P39711
Q3E791
P39712
P39708
P39713
P39714
A0A023PYD0
P39715
P27825
O13513
P39717
P39718
Q01574
P39719
P39720
P39721
P39722
A0A023PZE2
P39723
P39724
P39725
Q3E793
P39726
Q01329
O13514
P39727
P11433
P13365
P06182
P00549
Q8TGR8
Q3E741
P39728
P39729
P39730
O13515
P39731
P28003
P28005
P28004
A0A023PZ94
this list is enough, we can even take only the first 20.
O13588 Q8TGK7 Q3E770 Q8TGK6 P39709 P39710 O13511 O13512 Q6B2U8 P39711 Q3E791 P39712 P39708 P39713 P39714 A0A023PYD0 P39715 P27825 O13513 P39717 P39718 Q01574 P39719 P39720 P39721 P39722 A0A023PZE2 P39723 P39724 P39725 Q3E793 P39726 Q01329 O13514 P39727 P11433 P13365 P06182 P00549 Q8TGR8 Q3E741 P39728 P39729 P39730 O13515 P39731 P28003 P28005 P28004 A0A023PZ94
Can you provide a programmatic access link for that?
I forgot you can use gaas_ncbi_get_sequence_from_list.pl --list id.txt
from GAAS otherwise. You put the accession as a column in id.txt
. It is slow ~2 sec per sequence but for 20 sequences it is fine (I sleep 1 sec between two requests to be sure to not be blacklisted)
It needs to be a URL. In the same way the test files are.
see: https://github.com/NBISweden/pipelines-nextflow/blob/master/FunctionalAnnotationPreparation/config/test_profile.config
The aim would be to replace:
blast_db_fasta = '/projects/references/databases/uniprot/2018-03/uniprot_sprot.fasta'
with:
blast_db_fasta = 'https://www.uniprot.org/uniprot/?query=organism:4932&format=fasta'
Gah, glob characters are not allowed in https requests (even though I'm not trying to glob) so using REST API isn't an option.
Current options are use a different online file, or make a test dir of our own to pull the file from.
otherwise use this link...
blast_db_fasta = ftp://ftp.ensembl.org/pub/release-99/fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
add test if is is an archive unzip, if no DB, makeblastdb ...
I realised I can use the REST API. The path I provided in fromPath
appended a glob pattern, which I've now corrected, but now I'm trying to solve the conditional input execution.
OK. A conditional process has been added to make the blast database if needed.
Now all that remains is to fix the containers.
TranscriptAssembly container fixed. See #16
The interproscan bioconda recipe is very close to be done, I have fixed all missing dependencies, now I try to compile it but it fails I guess due to Java version... see https://github.com/bioconda/bioconda-recipes/pull/22802 I made a request to Conda-forge to release a newer version (see https://github.com/conda-forge/openjdk-feedstock/issues/72).
Test workflows and profiles exist for all the workflows
In order to make github actions work with the test profiles certain issues need to be resolved.
Issues:
params.blast_db_fasta
: Needs an URL to download the protein fasta file from.