Open adamcweiner opened 4 years ago
Here's a brief summary of how to use the new flags on single cell data:
1) create a telbam file for each bam file in the library (nothing new here)
2) merge telbams from library into one pseudobulk
samtools merge -f {bulk} -b {input}
where {input}
is a txt file containing one telbam.bam file path per line and {bulk}
is the pseudobulk bam file you wish to create
3) call telomerecat telbam2length to get length estimates
telomerecat telbam2length -cnt -e -i {input} -b {bulk} --output {csv_output}
where {input}
is a header-less CSV file containing telbam path, coverage depth, and number of telomeres (in that order) for each cell of interest, {bulk}
if the pseudobulk bam file you created in step 2, and {csv_output}
is the output file written by telomerecat containing the telomere lengths & other statistics.
If you wish to have telomerecat return the error profiles, here's how you use the new flags for that...
1) When running telomerecat telbam2length
on just one telbam, you can save the error profile (binary array) to a specified {err_path}
with the -ep
argument
telomerecat telbam2length {input} -ep {err_path} --output {csv_output}
2) When running telomerecat telbam2length
on multiple (single cell) telbams, you must also specify the boolean -el, --error_list
flag in order to note that the {err_path}
you specified is a txt file containing a paths to saved error profiles instead of an error profile itself
telomerecat telbam2length -e -i {input} -ep {err_path} -el --output {csv_output}
In this case, if your {input}
consisted of
/path/to/telbams/A_telbam.bam,
/path/to/telbams/B_telbam.bam,
/path/to/telbams/C_telbam.bam
and your {err_path}
was specified to be /path/to/error_profiles/error_profile_list.txt
, then said file will look like
/path/to/error_profiles/A_error_profile.txt,
/path/to/error_profiles/B_error_profile.txt,
/path/to/error_profiles/C_error_profile.txt
with each line corresponding to a saved error profile matrix
Thanks for the comments. I'll go ahead and lint according to the PEP8 style guide later today. As for the two unit tests you suggested, I haven't checked exact values for the same sample in existing vs new versions of telomerecat but I did notice that the mean absolute error of estimated vs real length remained the same in a simulated dataset when I switched from to the new version in this pull request. This plus my intuition tells me that we're probably fine but I'll write an explicit test so we can be confident.
So I've been testing on some WGS samples and realized that the --seed_randomness
flag doesn't seem to work for either version (old or new) of telomerecat. When I include the flag the length estimation never commences and if I run the same commands without the --seed_randomness
flag things work just fine. Does this happen on your end too? I haven't been able to pin down the exact bug yet but if you guys get the same error as well I'll keep looking.
For example, with this command telomerecat sits idle for an infinite amount of time before I decide to interrupt:
(env) (base) bash-4.2$ telomerecat telbam2length analysis/simulate_wgs//simulated_cell_1/simulated_cell_1_0_telbam.bam --seed_randomness -v 2 --output analysis/simulate_wgs/simulated_cell_1/simulated_cell_1_0_new_output.csv
telomerecat telbam2length has started. Start Time: 2020-11-16 09:43:40
----------------------------------------------------------------------
Collecting meta-data for all samples | 2020-11-16 09:43:40
- simulated_cell_1_0.bam | 2020-11-16 09:43:40
Commencing length estimation | 2020-11-16 09:44:19
- simulated_cell_1_0.bam | 2020-11-16 09:44:19
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[Status] telomerecat is quitting gracefully
^C(env) (base) bash-4.2$ ^C
Yet the same command with --seed_randomness
removed comes up with a length estimate within 2 minutes:
(env) (base) bash-4.2$ telomerecat telbam2length analysis/simulate_wgs//simulated_cell_1/simulated_cell_1_0_telbam.bam -v 2 --output analysis/simulate_wgs/simulated_cell_1/simulated_cell_1_0_new_output.csv
telomerecat telbam2length has started. Start Time: 2020-11-16 09:49:33
----------------------------------------------------------------------
Collecting meta-data for all samples | 2020-11-16 09:49:33
- simulated_cell_1_0.bam | 2020-11-16 09:49:33
Commencing length estimation | 2020-11-16 09:50:10
- simulated_cell_1_0.bam | 2020-11-16 09:50:10
Length estimation results written to the following file:
./simulated_cell_1_0_new_output.csv
----------------------------------------------------------------------
telomerecat telbam2length has finished. End Time: 2020-11-16 09:50:21
This has passed a bunch of testing on my end. The only large
TODO
item left is to figure out how to model uncertainty when using the--cov_ntel
flag.