Window positions incorrect >~1000 for mfasta

dstern commented 5 years ago

Thanks for the handy script. Running with multi-fasta file. For every chromosome, window position looks fine until ~>1000, then reports same window for all rows and the numbers don't really make sense (latter number is smaller than former number). Each file seems to "flip" at a different point. Maybe has something to do with keeping track of position across multiple files? e.g.

chr1 1 100 0.333 chr1 101 200 0.327 chr1 201 300 0.329 chr1 301 400 0.334 chr1 401 500 0.331 chr1 501 600 0.33 chr1 601 700 0.339 chr1 701 800 0.334 chr1 801 900 0.332 chr1 901 1000 0.323 chr1 1001 1100 0.329 chr1 1101 1112 0.326 chr1 1113 1112 0.29 chr1 1113 1112 0.279 chr1 1113 1112 0.278 chr1 1113 1112 0.27 chr1 1113 1112 0.256 chr1 1113 1112 0.256 chr1 1113 1112 0.269 chr1 1113 1112 0.274 chr1 1113 1112 0.263

DamienFr commented 5 years ago

Hello,

Thanks for using the tool and for reporting the bug. The multifasta option was recently added at the request of a user (see #issue1).

Would you be able to send me a failing multifasta and the exact command line that you used ?

From a quick check it seems that the size of your fasta is calculated wrong at the beginning, so it might contain special characters ?

At the moment i don't have time to fully check, but i'll take a deep look on it soon ;)

dstern commented 5 years ago

While preparing a file for you, I think I got closer to the problem. If I examine a file with a single fasta, then the output is fine. chr1 1 100 0.277 chr1 101 200 0.276 chr1 201 300 0.285 chr1 301 400 0.289 chr1 401 500 0.289 chr1 501 600 0.302 chr1 601 700 0.303 chr1 701 800 0.3 chr1 801 900 0.308 chr1 901 1000 0.307 chr1 1001 1100 0.29 chr1 1101 1122 0.288

If this sequence is embedded in a mfasta, then rows are added after the last correct position (1122 in this case), presumably reflecting remembrance of things past. I suspect some variable is not cleared after each record is processed.

chr1 1 100 0.28 chr1 101 200 0.19 chr1 201 300 0.23 chr1 301 400 0.09 chr1 401 500 0.37 chr1 501 600 0.29 chr1 601 700 0.44 chr1 701 800 0.39 chr1 801 900 0.34 chr1 901 1000 0.41 chr1 1001 1100 0.21 chr1 1101 1122 0.19 chr1 1123 1122 0.3 chr1 1123 1122 0.42 chr1 1123 1122 0.32 chr1 1123 1122 0.35 chr1 1123 1122 0.36 chr1 1123 1122 0.46 chr1 1123 1122 0.32 chr1 1123 1122 0.35

This is supported by the fact that when a longer record precedes a shorter record, the output for the shorter record shows the erroneous lines. This is not true is a shorter record precedes a longer record. I've attached a file that replicates the bug two.fasta.zip

.

DamienFr commented 5 years ago

You got my attention. I had a look at the script but could not run anything nor test with your data nor test my own modification because i'm on windows (linux is at work).

I still updated the code (UNTESTED version) with a solution :

The bug was that values of two hashes weren't deleted between different sequences of multifasta files. This cause no harm if sequences are in increasing order of length because each value (i.e. sequence position) of every hash is reused for each sequence, BUT as GCskew and % are kept for each positions, if second sequence is shorter than sequences treated earlier, it still has values calculated previously.

I corrected it.

Could you test it ? i'll be out of the office for some more days and will not be able to test it myself.

dstern commented 5 years ago

Works! Many thanks for your rapid attention to this little bug. This script was just what I needed and I appreciate how quickly you addressed the issue. All the best, David

DamienFr commented 5 years ago

You are very welcome ;)

DamienFr / GC_content_in_sliding_window

Window positions incorrect >~1000 for mfasta #4