Support of cohort feature in phenopacket

seoanezonjic commented 2 years ago

I'm interested in execute Lirical with a high number of HPO profiles. I think that the phenopacket's cohort feature is the correct option to manage this situation but Lirical not supports this. It only can executes a hpo profile at same time. Would be it possible? Thank you in advance

ielis commented 1 year ago

Hi @seoanezonjic the cohorts are not yet supported. The only way for analysis of larger number of samples is to run LIRICAL in a shell loop with phenopacket input. Unfortunately, this incurs some overhead due to bootstrapping LIRICAL app in the each iteration.

In principle, supporting cohort would not be too difficult with the current state of the code base. However, since it's been some time since you opened this issue (sorry for the delayed response), I am wondering if you are still in need of this functionality.

Thanks & cheers, Daniel

seoanezonjic commented 1 year ago

Hi @ielis Due to the great change in the software I needed time to install and use the last version to know how work the introduce changes. I really need the requested feature to execute benchmarking workflows and to analyze HPO profiles that I generate in custom analysis. I would like to give as input N patients to lirical and obtain as output the TSV file with results tagged by each patient in input (concatenating all the patient tables and adding a column with the patient id). Thank you in advance

ielis commented 1 year ago

Hi, yeah, there are many changes in the APIs, we're sorry for the breaking changes. We aim for a more stable code base starting from v2.0.0.

Regarding your application/benchmarking; in principle the analysis you describe should be doable with LIRICAL and a shell/python script for postprocessing results. The new JSON output format includes all the data you may need to generate the tables. The JSON format reports inputs (field analysisData), metadata (field analysisMetadata), and the results ordered by descending post-test probability (analysisResults). I would recommend to postprocess the JSON files with a Python script to get the tables you need as this will be much faster than waiting for cohort & family elements of the Phenopacket Schema to be supported by LIRICAL.

seoanezonjic commented 1 year ago

Hi @ielis Obviously I can do the approach that you suggest and I've done this. The problem is the overhead of execute N times (> 1000) lirical and the amount of files to process. Maybe, instead of use the cohort feature you could add an input flag pointing to a folder with N phenopackets and execute an iterative analysis within lirical. Your paper show iterative analysis in the benchmarking section, this feature would ease this kind of work and would be very useful. Thank you

ielis commented 1 year ago

Are you interested in running phenotype-only analysis, or each phenopacket does in fact have a VCF with variants?

seoanezonjic commented 1 year ago

In my case, my interest is phenotype-only analysis without genomic data.

ielis commented 1 year ago

I added a new command for performing phenotype-only analysis on a collection of phenopackets. Can you please give it a try and let me know if it works for you?

You'll have to build from source on a specific branch:

cd LIRICAL
git checkout experimental_cli
git pull
./mvnw -Prelease clean package

This will build LIRICAL and you'll have a JAR file at lirical-cli/target/lirical-cli-2.0.0-RC3-SNAPSHOT.jar:

Here we define alias for running lirical:

alias lirical="java -jar $(pwd)/lirical-cli/target/lirical-cli-2.0.0-RC3-SNAPSHOT.jar"
lirical experimental phenopackets --help

The command above will show help for the experimental phenopackets sub-command:

Usage: lirical experimental phenopackets [-ghVv] [--ddndv]
       [--display-all-variants] [--sdwndv] [--strict] [--use-orphanet]
       [--assembly={hg19,hg38}] [-b=<backgroundFrequencyFile>]
       -d=<liricalDataDirectory>
       [--default-allele-frequency=<defaultAlleleFrequency>]
       [-e=<exomiserDatabase>] [-e19=<exomiserHg19Database>]
       [-e38=<exomiserHg38Database>] [-m=<minDifferentialsToShow>]
       [-o=<outdir>] [--pathogenicity-threshold=<pathogenicityThreshold>]
       [-t=<lrThreshold>] [--transcript-db={REFSEQ,UCSC}]
       [--variant-background-frequency=<defaultVariantBackgroundFrequency>]
       [-x=<outfilePrefix>] [-f[=<outputFormats>...]]... [phenopacket file
       (s)...]
Run LIRICAL for several phenopackets at once in phenotype-only mode.
      [phenopacket file(s)...]
                       Input phenopacket(s).
  -v                   Specify multiple -v options to increase verbosity.
                       For example, `-v -v -v` or `-vvv`
      --assembly={hg19,hg38}
                       Genome build (default: hg38).
  -h, --help           Show this help message and exit.
  -V, --version        Print version information and exit.

(...)  # The remaining options are omitted here

You can run the command above on a collection of phenopacket files, JSON, YAML or protobuf should work OK. The outputs are written into a folder controlled by the -o | --output-directory option. The CLI works as in the official commands, e.g. you must provide the -d | --data option, you can use --strict, --transcript-db options, etc. There are few differences though:

-x | --prefix is ignored. The results for a phenopacket are stored into a file based on phenopacket | subject | id in your phenopacket
-e[19|38] options for providing path to Exomiser database are ignored since the command only runs phenotype-only mode.

That's it :)

I hope the command will be useful to you. Please let me know if you encounter any issues and also if it works OK. Then I'll merge the PR #623 and we'll keep the command in the codebase until we add support for Phenopacket Schema cohorts.

seoanezonjic commented 1 year ago

Hi @ielis The new feature works as expected but I have a new bottle neck. My analysis generates a great amount of hpo profiles (> 5000) and they are given to Lirical with the new feature avoiding the need of thousands of executions. But it have to write thousands of output files. Could you make the tsv file have an additional column with the phenopacket identifier and join all the results in one single file? Thank you in advance

TheJacksonLaboratory / LIRICAL

Support of cohort feature in phenopacket #560