Parameters and paths for test profile

mahesh-panchal commented 4 years ago

In order to make github actions work with the test profiles certain issues need to be resolved.

Issues:

FunctionalAnnotationPreparation.nf:
- params.blast_db_fasta: Needs an URL to download the protein fasta file from.
- Interproscan is locally installed, and not available through conda. Need a solution to make interproscan available in the github actions test environment.

mahesh-panchal commented 4 years ago

Possible solutions:

blast_db_fasta:
- Fetch protein fasta from ENA each time (might get ip blocked?) ? Just not sure what protein file to use. @Juke34 @LucileSol suggestions?
- Store a test protein file in a test_data branch?
Interproscan:
- Implement a skip interproscan flag?
- Install interproscan using github actions (probably not time feasible)?
- Utilize the docker container (https://hub.docker.com/r/biocontainers/interproscan) somehow only for interproscan ( Is this possible? )?

LucileSol commented 4 years ago

blast_db_fasta create a test protein file should not be complex, we can have a small one of a specific species. on which species is the test?
interproscan I am not sure, if we skip it then it is impossible to test the pipeline has it will be only a blast. And for the 2 other points, I don't know how feasible it is, maybe as the docker container is already ready it can be easier.

Juke34 commented 4 years ago

Take few sequences from second row in that file blastp.merged.gz

e.g: in this header sp|Q8TGK7|YAG8_YEAST the uniq ID is Q8TGK7.

mahesh-panchal commented 4 years ago

Take few sequences from second row in that file blastp.merged.gz

e.g: in this header sp|Q8TGK7|YAG8_YEAST the uniq ID is Q8TGK7.

Do you mean the 2nd column? What is that exactly? A blast of one of the sequences or all?

* blast_db_fasta
  create a test protein file should not be complex, we can have a small one of a specific species. on which species is the test?

My plan at the moment is to retrieve the contents of this link.

https://www.uniprot.org/uniprot/?query=organism:4932&format=fasta

* interproscan
  I am not sure, if we skip it then it is impossible to test the pipeline has it will be only a blast.
  And for the 2 other points, I don't know how feasible it is, maybe as the docker container is already ready it can be easier.

That's a good point. Then let's try and use the docker image. It should be possible to alias the interproscan command to the docker image run command somehow which will make it suitable for the conda environment.

Juke34 commented 4 years ago

Sorry yes I meant 2nd column. The species seems to be Yeast. The gff annotation file just contains few gene models we must be sure that few of them match this list https://www.uniprot.org/uniprot/?query=organism:4932&format=fasta. And if we want to minimise a maximum, just take few of the best hit from the result I sent you.

Juke34 commented 4 years ago

this list is enough, we can even take only the first 20.

O13588
Q8TGK7
Q3E770
Q8TGK6
P39709
P39710
O13511
O13512
Q6B2U8
P39711
Q3E791
P39712
P39708
P39713
P39714
A0A023PYD0
P39715
P27825
O13513
P39717
P39718
Q01574
P39719
P39720
P39721
P39722
A0A023PZE2
P39723
P39724
P39725
Q3E793
P39726
Q01329
O13514
P39727
P11433
P13365
P06182
P00549
Q8TGR8
Q3E741
P39728
P39729
P39730
O13515
P39731
P28003
P28005
P28004
A0A023PZ94

mahesh-panchal commented 4 years ago

this list is enough, we can even take only the first 20.

O13588
Q8TGK7
Q3E770
Q8TGK6
P39709
P39710
O13511
O13512
Q6B2U8
P39711
Q3E791
P39712
P39708
P39713
P39714
A0A023PYD0
P39715
P27825
O13513
P39717
P39718
Q01574
P39719
P39720
P39721
P39722
A0A023PZE2
P39723
P39724
P39725
Q3E793
P39726
Q01329
O13514
P39727
P11433
P13365
P06182
P00549
Q8TGR8
Q3E741
P39728
P39729
P39730
O13515
P39731
P28003
P28005
P28004
A0A023PZ94

Can you provide a programmatic access link for that?

Juke34 commented 4 years ago

https://www.uniprot.org/help/api%5Fbatch%5Fretrieval

Juke34 commented 4 years ago

I forgot you can use gaas_ncbi_get_sequence_from_list.pl --list id.txt from GAAS otherwise. You put the accession as a column in id.txt. It is slow ~2 sec per sequence but for 20 sequences it is fine (I sleep 1 sec between two requests to be sure to not be blacklisted)

mahesh-panchal commented 4 years ago

It needs to be a URL. In the same way the test files are. see: https://github.com/NBISweden/pipelines-nextflow/blob/master/FunctionalAnnotationPreparation/config/test_profile.config The aim would be to replace: blast_db_fasta = '/projects/references/databases/uniprot/2018-03/uniprot_sprot.fasta' with: blast_db_fasta = 'https://www.uniprot.org/uniprot/?query=organism:4932&format=fasta'

mahesh-panchal commented 4 years ago

Gah, glob characters are not allowed in https requests (even though I'm not trying to glob) so using REST API isn't an option.

Current options are use a different online file, or make a test dir of our own to pull the file from.

Juke34 commented 4 years ago

otherwise use this link... blast_db_fasta = ftp://ftp.ensembl.org/pub/release-99/fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz

add test if is is an archive unzip, if no DB, makeblastdb ...

mahesh-panchal commented 4 years ago

I realised I can use the REST API. The path I provided in fromPath appended a glob pattern, which I've now corrected, but now I'm trying to solve the conditional input execution.

mahesh-panchal commented 4 years ago

OK. A conditional process has been added to make the blast database if needed.

Now all that remains is to fix the containers.

The interproscan container fails on using the ProSitePatterns data as it's not in the container, but not disabled either like some of the other databases.
The transcript assembly pipeline needs a proper path to a correctly combined conda container.

mahesh-panchal commented 4 years ago

TranscriptAssembly container fixed. See #16

Juke34 commented 4 years ago

The interproscan bioconda recipe is very close to be done, I have fixed all missing dependencies, now I try to compile it but it fails I guess due to Java version... see https://github.com/bioconda/bioconda-recipes/pull/22802 I made a request to Conda-forge to release a newer version (see https://github.com/conda-forge/openjdk-feedstock/issues/72).

mahesh-panchal commented 10 months ago

Test workflows and profiles exist for all the workflows

NBISweden / pipelines-nextflow

Parameters and paths for test profile #13