Closed moahaegglund closed 2 years ago
@vwirta here is our suggestion about what to include in the delivery report for BALSAMIC, please have a look at it.
Thanks for drafting this! @moahaegglund and @keyvanelhami
Comments from me:
@vwirta
Quick replies
2) It is not a question of family relation, but rather that the gender is correct (as in what is expected) 3) Basically, is it an FFPE sample or not. Many tissues can be either FF or FFPE' 4) I think the version is important as this defines what has been used at our end. I do not think they have any other way of knowing. 5) This should be something very concise, but still specific. We can discuss with some of the customers to make sure they would understand our concise way of summarising this
@keyvanelhami
- Yes. @ashwini06 we can extract and add the version too, right?
I am not clear here what bait set version needs to be reported here. For example for the panel gmsmyeloid5.2 will the version be 5.2
or v2
?
Tagging @annaengstrom for the answer.
- Yes I guess it's easy to extract the filters when we know which balsamic version that was used for the analysis. Is that correct @ashwini06 ?
Currently, variant calling filters are embedded in balsamic code. But in order to report it in the delivery report, the used filters need to be summarized in a config file. These filters are generalized for each variant caller. This will be something similar to how we are passing qc_metrics to the delivery report. (or) We can add this variant_calling filtering information to the balsamic readthedocs and add the related link to those descriptions in the delivery report under the session with references
or so..
@keyvanelhami
7. Yes. @ashwini06 we can extract and add the version too, right?
I am not clear here what bait set version needs to be reported here. For example for the panel gmsmyeloid5.2 will the version be
5.2
orv2
? Tagging @annaengstrom for the answer.
In here we need some kind of identifier of the version that the user can understand. It could also be a version and link to our website with more info, or it could even perhaps be the bait set design IDs.
Version
and Bed
for the Bed Version tab in statusDB? And complement this with a link to more information.@ashwini06 The first digit (5) represents the panel and the second the version (2) of that panel. To be able to re-run the analysis both numbers are needed. @vwirta The TE-number will remain fixed for a combined panel synthesis (considered one panel), the TE-number will vary for the blended panel synthesis, but a blended panel consists of two or more panels which have a fixed TE-number. I am gathering this information on all panels and it is more or less a requirement for us to keep it somewhere (either only visible for us, or also visible for our customers)
Following the challenges with the pdf version of the report in https://github.com/Clinical-Genomics/scout/issues/3108 we should think about changing the format to landscape. Will we keep the "Coverage and QC report" that is generated for panel cases or will the delivery report replace this one?
I don't like landscape format... Can we consider getting rid of the table format if that makes the table too wide, and instead simply list items in rows? Basically two columns (parameter, value).
Regarding coverage and qc report... I think we should have one that is more detailed than can be added to the delivery report. The more detailed one could also include drop out regions. We need to think through the content of it as well
One more thing... I think we should add some of the PCT coverage numbers to the delivery report.
I have now updated the draft of the delivery report from the discussion we had last week. Please comment if something is missing or should be changed.
In table 3: Familj -> Case In table 4: The PCT coverage values chosen need to be linked to the expected performance of the assay. If we claim 5% VAF sensitivity, then we need to have 200x coverage assuming 10 alt reads to support variant (10 reads / 0,05 = 200 reads). If we have space in the table, I'd add 200x, 500x, 750x to capture panel performance. For WGS we should include lower thresholds. Not sure though how difficult it is to have different levels for different assays. If complicated, we could simplify here and refer to coverage report for further details
Is Bioinformatic analys mip-dna?
@vwirta I guess we can link the values for PCT_OFF_bait to the QC-criteria we chose for each panel in Balsamic. @ivadym maybe can answer to this. If we decide to include several different levels of PCT_OFF_bait, we can extract this from the HS_metrics?
@AnnaLeinfelt The value in that field is taken from status-db and will be "balsamic" for Balsamic cases.
Remember to only generate a report with a logo for the accredited analyses, for BALSAMIC analyses that is only targeted genome with the myeloid panel at the moment.
In that spectrum we can extract from Picard's HsMetrics the [..., 50X, 100X, 250X, 500X, 1000X, 2500X, ...]
coverage associated values. And yes, I think it would be possible to link what to show to the QC criteria @keyvanelhami.
Update from today's meeting with klinisk genetik:
@ivadym and @patrikgrenfeldt this is the final lay-out that you can use after our discussion from last week. @ivadym Also note that we need to extract 200X coverage from picard (see above)
Do we also have a document with a source for the information, field by field? Good educational opportunity for me to see how we pull this together.
@keyvanelhami
This is the same process but for the current MIP report, but many things are taken from the same place for the future Balsamic report. From the cg docstring (most likely the best place to look, although some things from trailblazer actually comes from MIP)
The report contains data from several sources:
status-db:
family
family.data_analysis missing on most re-runs
customer_name
applications
accredited
panels
samples
sample.internal_id
sample.status
sample.ticket
sample.million_read_pairs for sequenced samples, from demux + ready made libraries (rml), not for external
sample.prepared_at not for rml and external
sample.received_at
sample.sequenced_at for rml and in-house sequenced samples
sample.delivered_at
lims:
sample.name
sample.sex
sample.source missing on most re-runs
sample.application
sample.prep_method not for rml or external
sample.sequencing_method for sequenced samples
trailblazer:
sample.mapped_reads
sample.duplicates
sample.analysis_sex
pipeline_version
genome_build
chanjo:
sample.target_coverage
sample.target_completeness
scout:
panel-genes
calculated:
today
sample.processing_time
report_version
These are the provisional parameters that are used to generate the delivery report. It would be nice if someone could review the optional
or required
fields. If a required
field is missing, the report won't be generated, while optional
fields will appear as N/A
if their value couldn't be extracted.
Most of the data has been retrieved following the previous design of the MIP_DNA
delivery report (source
field).
Report attributes
optional: version: delivery report version; source: StatusDB/analysis/family/analyses(/index)
required: date: report generation date; source: CG runtime
required: accredited: whether the report is accredited or not; source: all(StatusDB/application/accredited)
Customer attributes:
required: name: customer name; source: statusDB/family/customer/name
optional: id: customer internal ID; source: statusDB/family/customer/internal_id
optional: invoice_address: customers invoice address; source: statusDB/family/customer/invoice_address
required: scout_access: whether the customer has access to scout or not; source: statusDB/family/customer/scout_access
Case attributes:
required: name: case name; source: StatusDB/family/name
Data analysis attributes:
required: customer_pipeline: data analysis requested by the customer; source: StatusDB/family/data_analysis
required: pipeline: actual pipeline used for analysis; source: statusDB/analysis/pipeline
optional: pipeline_version: pipeline version; source: statusDB/analysis/pipeline_version
optional: type: analysis type carried out; BALSAMIC specific; source: pipeline workflow
optional: genome_build: build version of the genome reference; source: pipeline workflow
optional: panels: list of case specific panels; MIP specific; source: StatusDB/family/panels
Sample attributes:
required: name: sample name; source: LIMS/sample/name
required: id: sample internal ID; source: StatusDB/sample/internal_id
required: ticket: ticket number; source: StatusDB/sample/ticket_number
optional: status: sample status provided by the customer; MIP specific; source: StatusDB/family-sample/status
optional: gender: sample gender provided by the customer; source: LIMS/sample/sex
optional: source: sample type/source; source: LIMS/sample/source
optional: tumour: whether the sample is a tumour or normal one; BALSAMIC specific; source: StatusDB/sample/is_tumour
Sample metadata attributes:
optional: capture_kit: panel bed used for the analysis
optional: capture_kit_version: panel bed version; BALSAMIC specific
optional: gender: gender estimated by the pipeline
optional: million_read_pairs: number of million read pairs obtained; source
optional: mapped_reads: percentage of reads aligned to the reference sequence; MIP specific
optional: duplicates: fraction of mapped sequence that is marked as duplicate
optional: target_coverage: mean coverage of a target region; MIP specific
optional: target_bases_10X: percent of targeted bases that are covered to 10X coverage or more; MIP specific
optional: target_bases_250X: percent of targeted bases that are covered to 250X coverage or more; BALSAMIC specific
optional: target_bases_500X: percent of targeted bases that are covered to 500X coverage or more; BALSAMIC specific
optional: median_coverage: median coverage in bases
optional: mean_insert_size: mean insert size of the distribution; BALSAMIC specific
optional: fold_80: fold 80 base penalty; BALSAMIC specific
Sample timestamp attributes:
optional: ordered_at: order date; source: StatusDB/sample/ordered_at
optional: received_at: arrival date; source: StatusDB/sample/received_at
optional: prepared_at: library preparation date; source: StatusDB/sample/prepared_at
optional: sequenced_at: sequencing date; source: StatusDB/sample/sequenced_at
optional: delivered_at: delivery date; source: StatusDB/sample/delivered_at
optional: processing_days: days between sample arrival and delivery; source: CG workflow
Sample method attributes:
optional: library_prep: library preparation method; source: LIMS/sample/prep_method
optional: sequencing: sequencing procedure; source: LIMS/sample/sequencing_method
Application attributes:
required: tag: application identifier; source: StatusDB/application/tag
optional: version: application version; source: LIMS/sample/application_version
optional: prep_category: library preparation category; source: StatusDB/application/prep_category
optional: description: analysis description; source: StatusDB/application/description
optional: limitations: application limitations; source: StatusDB/application/limitations
required: accredited: if the sample associated process is accredited or not; ; source: StatusDB/application/is_accredited
Report attributes
and Application attributes
: accredited
- How is accreditation status set when we only have the myeloid panel validated and it shares app tag with other types of panels?
Sample metadata attributes
: what are capture_kit
and capture_kit_version
used for? For analysis, the UDF bait set
is used to specify which bed file to be/has been used
Currently the accreditation field is extracted in the same way as it is done with MIP_DNA
: it checks the application.is_accredited
attribute of the case samples and if one of these values is set to False
, then it labels that report as not accredited. However, as the bait_set
value is also extracted, the accreditation field for generating the BALSAMIC report can be fixed based on this parameter (GMSmyeloid
) instead of evaluating each of the application.is_accredited
fields.
@henrikstranneheim since the template is taken from MIP generating the report, maybe we shouldn't change many of the parameters?
I can of course go through the new parameters added for Balsamic but not used in MIP, like e.g. fold-80
Most of them are actually the same ones used for MIP (they also share the same source). My question was more about the validation part. Since the dataflow will now be done with pydantic
models, I think there is no need to use the --force-report
flag. Therefore, we can define the minimum parameters necessary to generate the report ( the required
ones).
Regarding accreditation - I think Vadym's suggestion sounds good.
Regarding optional and required - It is by intent that you have to use the --force
flag since there was a lot of reports being generated with missing values and no one noticed and the customers did not complain. Now, the automation will fail and say what is missing so that manual override (if appropriate) can be performed. So I would like to keep the settings as they were for generating the MIP report and I assume that the same assumptions is true for Balsamic. @ivadym We could try to go through them together
The --force
flag is needed when there is no way of getting hold of the data. For instance, with legacy data not found in current LIMS etc.
Here are my suggestion of what parameters that should be changed from optional to required:
Data analysis attributes:
optional: pipeline_version: pipeline version; source: statusDB/analysis/pipeline_version
optional: type: analysis type carried out; BALSAMIC specific; source: pipeline workflow
Sample attributes:
optional: tumour: whether the sample is a tumour or normal one; BALSAMIC specific; source: StatusDB/sample/is_tumour
Sample metadata attributes:
optional: capture_kit: panel bed used for the analysis
optional: capture_kit_version: panel bed version; BALSAMIC specific
optional: million_read_pairs: number of million read pairs obtained; source
optional: duplicates: fraction of mapped sequence that is marked as duplicate
optional: target_bases_250X: percent of targeted bases that are covered to 250X coverage or more; BALSAMIC specific
optional: target_bases_500X: percent of targeted bases that are covered to 500X coverage or more; BALSAMIC specific
optional: median_coverage: median coverage in bases
optional: mean_insert_size: mean insert size of the distribution; BALSAMIC specific
optional: fold_80: fold 80 base penalty; BALSAMIC specific
Me and @ivadym discussed about few parameters for WGS that we need to clarify where the values should be taken from for the Delivery report. @henrikstranneheim and @vwirta please provide input for below topics:
For TGS, this value is taken from PERCENT_DUPLICATION
in picard DuplicationMetrics
. For WGS, the DuplicationMetrics
is not included in the rules, and instead the pct_duplication
from fastp
and percent_duplicates
from fastqQC
can be extracted.
However, the calculated dup percentage differs between fastp and fastQC for WGS:
grep -i 'percent_duplicates\|pct_duplication' /home/proj/production/cancer/cases/richgrouse/analysis/qc/multiqc_data/*
And the the calculated dup percentage differs between DuplicationMetrics, fastp and fastQC for in TGA:
grep -i 'percent_duplicates\|pct_duplication\|PERCENT_DUPLICATION' /home/proj/production/cancer/cases/deargrouse/analysis/qc/multiqc_data/*
So the questions is, which duplication value should we choose for WGS? Looking at several TGA samples, the percent_duplicates
from fastQC is always closer to the value provided by DuplicationMetrics
compared to pct_duplication
from fastp.
My suggestions is to use fastQC for WGS. Any thoughts?
We chose 100X and 250X for TGA, but this is far too high for WGS normal sample. Should we include a 3rd values, PCT_30X
for normal WGS sample and then the PCT_100X
will represent the tumor WGS sample which usually are sequenced deeper?
Balsamic doesn't predict or performs any gender checks. We decided to use the gender from status-db provided from the customer when submitting the order.
Generated HTML report for a TN-WGS BALSAMIC case:
Feel free to give some input about the structure and content of the file (the source of some fields has not been decided yet, as @keyvanelhami pointed out).
@ivadym beautiful work! Me and Patrik are back next week from vacation - let’s try to nail the last details of this early next week so you can merge and deploy this
Looks great!
A few comments from me
1) The version information at the top is great! Could it be strengthened even further by replacing ": 1" with “version: 1”?
2) Customer attributes - would be great to see an example of how these field look like for a real case
3) In Headers of columns, only capitalize the first word. For example “Bioinformatisk analys”
4) Case table, column ‘Analystyp’. Is this hard coded in BALSAMIC? If it is, would be great to see all the different alternatives to make sure they are understandable. If we have flexibility regarding these, I’d prefer to have them in Swedish, but we should not do a lot of extra work for achieving it. Enough that they are easily understandable.
5) Where do I end up if I click the ‘Variant-calling filters? Is it clear here what callers have been applied for this analysis type?
6) Coverage table. Is the ’Täckningsgrad’ columns fixed to 250 and 500x always? For WGS applications we would need something lower. Here I’d suggest to add 15x to allow estimating performance of sequencing of Normal sample and 60x for tumour sample. (Background: at 60x we can detect variants at 5% VAF if using criteria of 3 supporting reads)
7) KE, your question regarding duplication rate… I have no preferences for source. LEt’s use what seems to be most accurate. (The question of why these values differ is of course interesting to try to answer…)
On 3 Mar 2022, at 20:19, Henrik Stranneheim @.***> wrote:
@ivadym https://github.com/ivadym beautiful work! Me and Patrik are back next week from vacation - let’s try to nail the last details of this early next week so you can merge and deploy this
— Reply to this email directly, view it on GitHub https://github.com/Clinical-Genomics/cg/issues/1378#issuecomment-1058401346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUGEGJKQ2KUYHGGTWY3F3TU6EGCZANCNFSM5MX2R23A. You are receiving this because you were mentioned.
@vwirta I would use 30X for normal instead:
For above WGS case richgrouse
the normal sample (app tag WGSPCFC030) got "PCT_15X": 0.972587
and "PCT_30X": 0.726772
with 400 M r-p. We guarantee 22.5X mean coverage for that app that, right?
@vwirta I would use 30X for normal instead:
For above WGS case
richgrouse
the normal sample (app tag WGSPCFC030) got"PCT_15X": 0.972587
and"PCT_30X": 0.726772
with 400 M r-p. We guarantee 22.5X mean coverage for that app that, right?
From WGSPCFC030
Whole-genome sequencing, PE 2x150, 30x coverage using NovaSeq 6000, PCR-free library preparation. 26x coverage guaranteed (OMIM panel) for samples fulfilling criteria. The reference genome is hg19. For cases where analysis in MIP has been requested, single nucleotide variants and short insertion/deletions will be identified usingGATK haplotype caller and variants ranked according to expected disease-causing potential. Structural variants are called using Manta, Tiddit and CNVnator. Short tandem repeat expansions are called using ExpansionHunter. Top ranked variants for selected gene panels are uploaded to Scout for further evaluation. Cases analysed in MIP are accredited.
So for rare disease cases we guarantee 26 x coverage (OMIM panel).
You've done a great job with this report!
A thought - most of it is in swedish except the text in sections Teknisk begränsning av analysen and Begränsningar av analysen.
@vwirta I'll add the proposed changes to the report and here are some answers from my part:
2) Customer attributes - would be great to see an example of how these field look like for a real case
Actually, I think this was a real case. However, kundinformation
only includes the customer's invoice address (following MIPs report structure), but this could be extended with some additional fields (customer name, ...? )
4) Case table, column ‘Analystyp’. Is this hard coded in BALSAMIC? If it is, would be great to see all the different alternatives to make sure they are understandable. If we have flexibility regarding these, I’d prefer to have them in Swedish, but we should not do a lot of extra work for achieving it. Enough that they are easily understandable.
These are the different types: ["tumor_wgs", "tumor_normal_wgs", "tumor_panel", "tumor_normal_panel"]
. And I think having them in Swedish is straightforward if someone can help me with the translation.
5) Where do I end up if I click the ‘Variant-calling filters? Is it clear here what callers have been applied for this analysis type?
It will redirect you here: https://balsamic.readthedocs.io/en/latest/balsamic_filters.html (maybe an explicit url will be more useful?). And BALSAMIC does not provide you with the specific callers, although it can be hard coded since currently all of them are executed depending on the analysis (I will exclude some depending on the analysis between – wgs
or panel
).
Hi Vadym,
Sorry for the late reply to this.
Regarding 2) As long as we can have the custoner name included here, I am fine.
Regarding 4) I would like to propose these translations:
"tumor_wgs”, Tumör-endast (helgenomsekvensering)
"tumor_normal_wgs”, Tumör/normal (helgenomsekvensering)
“tumor_panel” Tumör-endast (panelsekvensering)
“tumor_normal_panel Tumör/nornal (panelsekvensering)
Regarding 5) As this is a pdf that might get printed out and scanned as image when adding to medical records, I’d prefer to have the explicit URL in order not to loose any information. I think we need to be able to convey somehow what callers we have used in the specific analysis. Not sure how to do it right now. If this is difficult right now, we can return to this later.
regards, VW
On 10 Mar 2022, at 23:16, ivadym @.***> wrote:
@vwirta https://github.com/vwirta I'll add the proposed changes to the report and here are some answers from my part:
Customer attributes - would be great to see an example of how these field look like for a real case Actually, this is a real case. However, kundinformation only includes the customer's invoice address (following MIPs report structure), but this could be extended with some additional fields (customer name, ...? )
Case table, column ‘Analystyp’. Is this hard coded in BALSAMIC? If it is, would be great to see all the different alternatives to make sure they are understandable. If we have flexibility regarding these, I’d prefer to have them in Swedish, but we should not do a lot of extra work for achieving it. Enough that they are easily understandable. These are the different types: ["tumor_wgs", "tumor_normal_wgs", "tumor_panel", "tumor_normal_panel"]. And I think having them in Swedish is straightforward if someone can help me with the translation.
Where do I end up if I click the ‘Variant-calling filters? Is it clear here what callers have been applied for this analysis type? It will redirect you here: https://balsamic.readthedocs.io/en/latest/balsamic_filters.html https://balsamic.readthedocs.io/en/latest/balsamic_filters.html (maybe an explicit url will be more useful?). And BALSAMIC does not provide you with the specific callers, although it can be hard coded (if its necessary) since currently all of them are executed during analysis.
— Reply to this email directly, view it on GitHub https://github.com/Clinical-Genomics/cg/issues/1378#issuecomment-1064561577, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUGEGPFZARL3FXT3BFVSKLU7JYDZANCNFSM5MX2R23A. You are receiving this because you were mentioned.
Hi Vadym,
You can use following text to describe those QC parameters that are missing from previous report template:
Läspar: Antal sekvenseringsläsningar i miljoner läspar. Duplikat: sekvenseringsläsningar som är i duplikat och därmed ej unika sekvenser. Hög mängd duplikat kan tyda på dålig komplexitet av sekvenserad bibliotek. Tillämpade filter för variantanrop:
Looks good @keyvanelhami One addition to Duplikat
Duplikat: sekvenseringsläsningar som är i duplikat och därmed ej unika sekvenser. Hög mängd duplikat kan tyda på dålig komplexitet av sekvenserad bibliotek eller djup sekvensering.
Hi all!
it seems as this is almost ready? To complete the swedac deviation I would like to have an empty report wit a swedac logo to submit early next week (Monday if possible).
If we need to fill out an application it is the one for target sequencing and myeloid panel, PANKTTR030.
@ivadym, @keyvanelhami
Hej!
@keyvanelhami, @vwirta – some minor updates regarding the report for WGS cases:
@AnnaLeinfelt - here is the report including a recent myeloid case. Tell me if you need it in a specific format (html, pdf, ...) or if something has to be changed.
Empty report: balsamic_delivery_report.pdf
@ivadym Fantastic work! 🚀
I agree, looks fantastic! Great work!
Very nice Vadym.
Just a minor thing: Could you write the descriptions of the QC values (Förklaring) in the same order as in the table? I.e. Läspar, Mediantäckning, Täckningsgrad 250X, etc...
Thanks @keyvanelhami. I updated the previous myeloid report with these changes.
Thanks @ivadym, it looks nice! If not too much trouble I would like a report in pdf format with app tag PANKTTR030. It can be empty or with customer samples but a swedac logo is needed.
Is this issue resolved and can be closed?
I think so, further changes to the delivery report should be addressed in new issues. Closing!
Description
Me and @keyvanelhami discussed what to include in the delivery report for BALSAMIC. This is our proposed changes from the delivery report for mip. (The text in the fields below also has to be updated.)
(Fold-80, median target coverage and fragment length is available in the
Coverage and QC report
that is generated for TGA cases and available in Scout.)