cancerit / telomerecat

Telomerecat: The telomere computational analysis tool
GNU General Public License v3.0
20 stars 5 forks source link

Adam develop #17

Open adamcweiner opened 4 years ago

adamcweiner commented 4 years ago

This has passed a bunch of testing on my end. The only large TODO item left is to figure out how to model uncertainty when using the --cov_ntel flag.

adamcweiner commented 4 years ago

Here's a brief summary of how to use the new flags on single cell data:

1) create a telbam file for each bam file in the library (nothing new here)

2) merge telbams from library into one pseudobulk samtools merge -f {bulk} -b {input} where {input} is a txt file containing one telbam.bam file path per line and {bulk} is the pseudobulk bam file you wish to create

3) call telomerecat telbam2length to get length estimates telomerecat telbam2length -cnt -e -i {input} -b {bulk} --output {csv_output} where {input} is a header-less CSV file containing telbam path, coverage depth, and number of telomeres (in that order) for each cell of interest, {bulk} if the pseudobulk bam file you created in step 2, and {csv_output} is the output file written by telomerecat containing the telomere lengths & other statistics.

adamcweiner commented 4 years ago

If you wish to have telomerecat return the error profiles, here's how you use the new flags for that...

1) When running telomerecat telbam2length on just one telbam, you can save the error profile (binary array) to a specified {err_path} with the -ep argument telomerecat telbam2length {input} -ep {err_path} --output {csv_output}

2) When running telomerecat telbam2length on multiple (single cell) telbams, you must also specify the boolean -el, --error_list flag in order to note that the {err_path} you specified is a txt file containing a paths to saved error profiles instead of an error profile itself telomerecat telbam2length -e -i {input} -ep {err_path} -el --output {csv_output} In this case, if your {input} consisted of

/path/to/telbams/A_telbam.bam,
/path/to/telbams/B_telbam.bam,
/path/to/telbams/C_telbam.bam

and your {err_path} was specified to be /path/to/error_profiles/error_profile_list.txt, then said file will look like

/path/to/error_profiles/A_error_profile.txt,
/path/to/error_profiles/B_error_profile.txt,
/path/to/error_profiles/C_error_profile.txt

with each line corresponding to a saved error profile matrix

adamcweiner commented 4 years ago

Thanks for the comments. I'll go ahead and lint according to the PEP8 style guide later today. As for the two unit tests you suggested, I haven't checked exact values for the same sample in existing vs new versions of telomerecat but I did notice that the mean absolute error of estimated vs real length remained the same in a simulated dataset when I switched from to the new version in this pull request. This plus my intuition tells me that we're probably fine but I'll write an explicit test so we can be confident.

adamcweiner commented 4 years ago

So I've been testing on some WGS samples and realized that the --seed_randomness flag doesn't seem to work for either version (old or new) of telomerecat. When I include the flag the length estimation never commences and if I run the same commands without the --seed_randomness flag things work just fine. Does this happen on your end too? I haven't been able to pin down the exact bug yet but if you guys get the same error as well I'll keep looking.

adamcweiner commented 4 years ago

For example, with this command telomerecat sits idle for an infinite amount of time before I decide to interrupt:

(env) (base) bash-4.2$ telomerecat telbam2length analysis/simulate_wgs//simulated_cell_1/simulated_cell_1_0_telbam.bam --seed_randomness -v 2 --output analysis/simulate_wgs/simulated_cell_1/simulated_cell_1_0_new_output.csv
telomerecat telbam2length has started. Start Time: 2020-11-16 09:43:40
----------------------------------------------------------------------

 Collecting meta-data for all samples | 2020-11-16 09:43:40
    - simulated_cell_1_0.bam | 2020-11-16 09:43:40

 Commencing length estimation | 2020-11-16 09:44:19
    - simulated_cell_1_0.bam | 2020-11-16 09:44:19
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user
[ERROR] telomerecat interrupted by user

[Status] telomerecat is quitting gracefully

^C(env) (base) bash-4.2$ ^C

Yet the same command with --seed_randomness removed comes up with a length estimate within 2 minutes:

(env) (base) bash-4.2$ telomerecat telbam2length analysis/simulate_wgs//simulated_cell_1/simulated_cell_1_0_telbam.bam -v 2 --output analysis/simulate_wgs/simulated_cell_1/simulated_cell_1_0_new_output.csv
telomerecat telbam2length has started. Start Time: 2020-11-16 09:49:33
----------------------------------------------------------------------

 Collecting meta-data for all samples | 2020-11-16 09:49:33
    - simulated_cell_1_0.bam | 2020-11-16 09:49:33

 Commencing length estimation | 2020-11-16 09:50:10
    - simulated_cell_1_0.bam | 2020-11-16 09:50:10

 Length estimation results written to the following file:
    ./simulated_cell_1_0_new_output.csv

----------------------------------------------------------------------
telomerecat telbam2length has finished.  End Time: 2020-11-16 09:50:21