Phylo csf++ bed included

pskvins commented 3 years ago

Hello Doctor Christopher Pockrandt, I'm who asked about issue that happens with Drosophila melanogaster chromosome 2L(https://github.com/cpockrandt/PhyloCSFpp/issues/5)

I'm asking pull request of code, which is for generating bed file that gives the information about coding region. It makes bed file when --output-phylo is true in build-tracks.

I used the same command as I showed you in the issue, but I used the whole genome length as you told me to.

I compared the result file with the file 'PhyloCSF+2.bed.gz' that I added below. PhyloCSF+2.bed.gz As a result, among 13334 regions in PhyloCSF+2.bed.gz, 13108 regions are perfectly matched, 37 regions are found only in PhyloCSF+2.bed.gz, 202 regions are found only in result file, and 325 regions in PhyloCSF+2.bed.gz are more than 95% matched (but not perfectly matched) with the result file.

Please take a look on the code that I added, and let me know if I have to refine any of my code.

Thank you.

cpockrandt commented 3 years ago

Thank you for your work and the pull-request @pskvins

I'm still testing your feature and will come back when it finished. There are a few TODOs in your code. Are they still relevant or can they be deleted?

I have two major suggestions:

can you add a new flag --output-regions BOOL and turn it off by default (like --output-phylo BOOL) so users can turn it on and off as they like?
you deleted a test file chr22_25_28_each_30k_bases.maf.gz as well as added some whitespace changes that were probably introduced by your editor. Can you undo that so that the PR is cleaner and doesn't change anything that does not affect your code? Then the tests on travis should also pass.

Christopher

pskvins commented 3 years ago

Thank you for reviewing my code, Christopher.

I added --output-regions BOOL and make it to show the error message when people try to run --output-regions when phylo_smooth is false. I checked if the added function work well in three different conditions: where I omitted --output-regions, where I added --output-regions 0, and for --output-regions 1.

I also added the explanation about --output-regions and checked if --help shows it well, recovered a test file chr22_25_28_each_30k+bases.maf.gz, and undid all the whitespace that I could find.

Please take a look on the modification and let me know about the things I have to correct more.

Thank you.

Sukhwan

pskvins commented 3 years ago

I made --output-regions to work without --output-phylo being true, and made changes you requested except putting result in ./test/expected_results/build-tracks because I wanted you to check if the result bed file looks valid.

I did a same test using the chr2L of drosophila melanogaster as I told you at the beginning(https://github.com/cpockrandt/PhyloCSFpp/issues/5). As a result, I got PhyloCSF+2_C++.bed, and compared with PhyloCSF+2_chr2L.bed file which I only captured informations of chr2L from original bed file.

PhyloCSF+2_chr2L.bed.gz

PhyloCSF+2_C++.bed.gz

Among 13334 potential CDSs from original file, 13108 CDSs (98.3%) are found to be exactly same with the resulted bed file, and 37 CDSs (0.28%) are found only (without any overlap) in original file. Resulted bed file has 13690 potential CDSs, which has 202 CDSs that are found only in resulted file, and 379 CDSs are found to be the fragments (overlap 100%, but not exactly same) of CDSs from original file.

I attached a table below that shows what I've explained above.

comparing original and generated PhyloCSF2+.bed.pptx

Please take a look and let me know if I can add result bed files to expected_results/build-tracks

pskvins commented 3 years ago

I've compared PhyloCSF wig files and raw wig files, and find out that while raw scores are almost exactly same in every place, whereas scores in wig files are different largely only in -1 and -3 frames, different moderately in -2 frame, and almost same in every place for positive strands.

I find out that differences in the scores are mostly generated from the codon only appears in the either side of wig files, and also from codons that are near the codon only appears in the either side of wig files, and thought that gap would have caused forward-backward algorithm to compute the score near the gap different from the original file.

I counted all the gaps in both wig files, and it turns out that for positive strands, there are 2 or no gaps between two files, whereas -1 and -3 frames have about 12,000 gaps on each file and -2 frame has about 230 gaps on each file out of about 7 million codons in each wig file.

I've counted the number of potential coding regions from bed files that are not exactly same with each other, and for positive strands and -2 frame, about 200 are counted from original bed file and about 520 regions are counted from the file I generated. But in -1 and -3 frames, about 415 regions are counted from original file and 769 and 804 regions are counted from -1 and -3 frame respectively. In here, the difference between two cases might have been caused by the score difference in frame -1 and -3. When I compared the scores of wig files, there was about 1480 codons that the signs of scores are different in positive strand and frame -2, when -1 and -3 frames have about 2330 codons of different signs of scores. Also, among this codons in positive strand and frame -2, score that I generated tend to show more high scores (about 60%), when frame -1 and -3 showed equal percentage.

Lastly, while I run a program, I found one issue about frame. I generated datas from dm6 msa, and I found the negative frame number is off by 1 from the negative frame of original wig file (for example, frame -1 of wig file I generated was same as frame -2 of original PhyloCSF data)

Please take a look and let me know if you need more information or if I need to correct my code.

Thank you. Sukhwan Park

cpockrandt commented 3 years ago

Thank you for looking so much into it! I discussed it with Martin and we agree that your code looks good and the discrepancies are most likely due to missing values in the smoothing. I will look into it.

I have just one tiny change before we merge it: Can you name the bed files the same way as the official PhyloCSF bed files? i.e., "PhyloCSF+1Regions.bed" instead of "PhyloCSF+1.bed". Thank you! :)

pskvins commented 3 years ago

I changed the name of the bed files and made the function of calculating color code to get rid of the redundancies.

I also found the cause of the track scores from negative frames. It was because of the starting position of bls_pos from for (size_t xx = 0, bls_pos = aln.skip_bases; xx < lpr_per_codon[thread_id].size(); ++xx, bls_pos += 3) For negative strand, it seems like it has to take bls_pos = aln.length() % 3 rather than bls_pos = aln.skip_bases so I splitted up the code for the case of positive strand and negative strand. I've already tried to compare the resulted track files with the original track files, and almost every score look same. I attached a graph below for comparing score of PhyloCSF and PhyloCSF++ after I changed a score.

compared_new_wig-3 compared_new_wig-2 compared_new_wig-1 compared_new_wig+3 compared_new_wig+2 compared_new_wig+1

Lastly, there was also a difference between PhyloCSF and PhyloCSF++ about frames in negative strand. It seems like the way of deciding frame on the negative strand is different. Due to this, it seems like it is little bit complex to compare scores from PhyloCSF++ with scores from PhyloCSF, especially when the file contains the informations about multiple chromosomes. I heard that you already know about this, but I wanted to tell about it just in case.

If there is anything to change, please let me know. Thank you!

cpockrandt commented 3 years ago

It seems travis did not run for your latest commit, so I will close and reopen your PR, which will hopefully trigger a build.

Edit: there was a merge conflict which led to travis not running.

cpockrandt commented 3 years ago

Thank you @pskvins, I did the following two things:

I rebased your commits on master (there was a conflict because of changes on master in the meantime, and your PR is going to go onto the master branch).
I made some changes in the commit "Code improvements"

This includes:

removal of some code duplication
do not write bed files when the user did not set the flag (instead of deleting them afterwards)
I made a lot of minor changes to reduce the "diff" to master, so it is easier to understand the changes in this commit. When I merge the commit, I will squash all the commits together
and some other changes that I believe make the code easier to read or maintain

Before merging, I will check the results myself (might take 1-2 days), because I noticed that also the -2 strand and the raw files changed, and I want to make sure that I did not mess up your code with my changes. I'm happy that you found and fixed the bug!

cpockrandt commented 3 years ago

LGTM!

cpockrandt / PhyloCSFpp

Phylo csf++ bed included #6