ay-lab / fithic

Fit-Hi-C is a tool for assigning statistical confidence estimates to chromosomal contact maps produced by genome-wide genome architecture assays such as Hi-C.
MIT License
79 stars 16 forks source link

TypeError: can only concatenate str (not "int") to str #42

Open FatihSarigol opened 3 years ago

FatihSarigol commented 3 years ago

Hello, My test run (fithic/tests/run_tests-git.sh) finished successfully, but while running it on my files using this command using version 2.0.7:

python3 fithic.py -f fithic.fragmentMappability.gz -i fithic.interactionCounts.gz -o FitHicAmphioxus -t fithic.biases.gz -r 150000

I received this error:

Reading the contact counts file to generate bins... Interactions file read. Time took 26.23392629623413 Traceback (most recent call last): File "/home/user/sarigoel/Programs/FITHIC/fithic/fithic/fithic.py", line 1324, in main() File "/home/user/sarigoel/Programs/FITHIC/fithic/fithic/fithic.py", line 323, in main (binStats,noOfFrags, maxPossibleGenomicDist, possibleIntraInRangeCount, possibleInterAllCount, interChrProb, baselineIntraChrProb) = generate_FragPairs(observedInterAllCount, observedInterAllSum, binStats, fragsFile, resolution) File "/home/user/sarigoel/Programs/FITHIC/fithic/fithic/fithic.py", line 600, in generate_FragPairs print("ERROR - the chromosome " + ch + " has " + len(allFragsDic[ch]) + " valid fragments/bins and should be removed from the input fragment information !!! ") TypeError: can only concatenate str (not "int") to str

Here is how my input files look like:

[sarigoel@myotis AMPHIOXUS]$ zcat fithic.biases.gz | head -n2 Sc7u5tJ_517 75000 1.970547623956338 Sc7u5tJ_517 225000 0.40157523166875075 [sarigoel@myotis AMPHIOXUS]$ zcat fithic.fragmentMappability.gz | head -n2 Sc7u5tJ_517 0 75000 17395 1 Sc7u5tJ_517 150000 225000 2437 1 [sarigoel@myotis AMPHIOXUS]$ zcat fithic.interactionCounts.gz | head -n2 Sc7u5tJ_517 75000 Sc7u5tJ_517 75000 1700 Sc7u5tJ_517 75000 Sc7u5tJ_517 225000 5

I used an old HicPro (version 2.10.0) to generate my initial data and used this command/script to convert it:

python3 HiCPro2FitHiC.py -i Sample1_150000.matrix -b Sample1_150000_abs.bed -s Sample1_150000_iced.matrix.biases -o . -r 150000

These files had these lengths: 3776446 Sample1_150000.matrix 3769 Sample1_150000_abs.bed 3769 Sample1_150000_iced.matrix.biases

and first two lines were as below:

**==> Sample1_150000.matrix <== 1 1 1700 1 2 5

==> Sample1_150000_abs.bed <== Sc7u5tJ_517 0 150000 1 Sc7u5tJ_517 150000 246623 2

==> Sample1_150000_iced.matrix.biases <== 1.917118534333063673e+00 3.906869898508548156e-01**

Sample1_150000_iced.matrix.biases file had also nan values which were I guess converted to -1.

Following the conversion the files kept their original lengths:

[sarigoel@myotis AMPHIOXUS]$ zcat fithic.interactionCounts.gz | wc -l 3776446 [sarigoel@myotis AMPHIOXUS]$ zcat fithic.fragmentMappability.gz | wc -l 3769 [sarigoel@myotis AMPHIOXUS]$ zcat fithic.biases.gz | wc -l 3769

As for the chromosome names, all start with Sc7u5tJ_ and there is no other special character than an underscore, each followed by a scaffold number.

The log file had these lines:

########### Interactions file read successfully Observed, Intra-chr in range: pairs= 275495 totalCount= 6213510 Observed, Intra-chr all: pairs= 275495 totalCount= 6213510 Observed, Inter-chr all: pairs= 3500951 totalCount= 7397792 Range of observed genomic distances [0 35250000]

Making equal occupancy bins Observed intra-chr read counts in range 6213510 Desired number of contacts per bin 62135.1, Number of bins 100 Equal occupancy bins generated

Looping through all possible fragment pairs in-range_ ############

Can you think of a reason that may have caused the error? Thank you!

ay-lab commented 3 years ago

I believe the error is coming from "ch" in the line below being an integer for at least one chr. Can you check all your contigs/chrs to make sure none of them are somehow integers (I see you are saying that already but) print("ERROR - the chromosome " + ch + " has " + len(allFragsDic[ch]) + " valid fragments/bins and should be removed from the input fragment information !!! ") Other possibility is a python version difference related problem about len(allFragsDic[ch]) being an integer (it should be) and not being able to append to the overall string. Overall, I believe if you filter out the chrs/contigs with no valid bins (all out of bias value range) then the code should run

FatihSarigol commented 3 years ago

Thank you for your reply. I checked again this time using grep by searching for a scaffold that doesn't have Sc7u5tJ_ and found none, so all indeed have this at the beginning, and then a number. I removed the bins from the bed file that were shorter than my bin size 150000 of bases (is that what you mean by filtering out contigs with no valid bins/out of bias values range?) but then the HiCPro2FitHiC.py gave a key error with that bed file (I suppose I also need to remove them from the biases file? or also all interactions from the matrix as well?) Or do you mean the ones with nan value in biases file? Thanks!

aryakaul commented 3 years ago

Can you confirm you're using Python3? If you are and are still getting this, then I would recommend attempting to remove the Sc7u5tJ_ from the files (sed 's/Sc7u5tJ_//g') and see if that resolves it.

FatihSarigol commented 3 years ago

Thank you for your reply again, Yes I am using Python3.8 but it is on an HPCC and I installed the dependency packages myself locally while using the python that is installed on the cluster. Having said that, I just tried the test script again at this instance of connection and it again said All tests completed successfully. Fit-Hi-C is up and running! at the end. I see that the test script calls python via python3 command and that is the same way I am running it too. (python2 also happens to be on my path and python command without specifying version calls that since it is located on some default bin folder so if somewhere inside the code there is a line that calls python by python in a similar way to the test script but this time instead of by python3 and without then it would end up calling python2) I removed the Sc7u5tJ_ with the code you suggested and conversion went well but FitHiC again gave the same error on the new files where contig names are only numbers. I tried to add 3 to the environment of fithic.py but that gave another error. So if you believe even though the test script runs well it may be related to python2 being called somehow, I can try installing it via conda I guess. Thanks!

aryakaul commented 3 years ago

after looking at the code, I think the error you're getting is actually a bug in the way we output our error message.

Regardless, this is a check to make sure people don't see the #39 error. I'd go through the scaffolds in your fragments + bias file to make sure you have no scaffold which has no valid fragments. You can also throw a print(ch) right before this line to find out which scaffold is causing the issue.

FatihSarigol commented 3 years ago

Thank you for your help one more time!

Below I show the bias values of the chr names it printed (and it stopped after the last one) when I ran it by adding print(chr) to line 600:

517 75000 1.970547623956338 517 225000 0.40157523166875075 1522 75000 0.09166041122148283 836 75000 0.06371769236176253 396 75000 0.10903455977486064 462 75000 0.18866132416648182 818 75000 0.5478669940197456 818 225000 0.0831076690152435 429 75000 0.9607972377588191 1131 75000 -1 1239 75000 -1

So is the problem then having -1 as a bias value (those had been converted from nan in the Sample1_150000_iced.matrix.biases file by the HiCPro2FitHiC.py)? But if that were the case I would expect it to have stopped at the previous scaffold which also had -1, so I looked into the fragmentMappability file and saw that the last one has zero as a difference:

517 0 75000 17395 1 517 150000 225000 2437 1 1522 0 75000 503 1 836 0 75000 303 1 396 0 75000 685 1 462 0 75000 1154 1 818 0 75000 3455 1 818 150000 225000 475 1 429 0 75000 6951 1 1131 0 75000 3 1 1239 0 75000 0 0

I checked out the https://github.com/ay-lab/fithic/issues/39 and there it seems like any chromosome below bias values all less than 0.5 I should remove actually? It looks like in my case I have quite a lot of those, probably because I mapped the HiC reads to the whole reference genome and happened to run HicPro also on small scaffolds rather than only on actual chromosomes.

Anyway at least for this specific error I can say that I tried different things and saw that my error finally went away when I removed the chromosomes (in reality short scaffolds) with zero mappability from fragmentMappability file and the corresponding lines from the biases file at the same time, which is not straightforward to do by simple pattern matching since biases file doesn't include info about scaffolds with zero mappability and in my case fithic.fragmentMappability also contains scaffold names to remove on column 4 too, but if anybody runs into same issue I can happily share my solution... Or since FitHic won't run in any similar case apparently, HiCPro2FitHiC.py may include a few lines to remove such cases from the two files directly maybe? One last thing, it also worked when I kept a chromosome with only the last window with zero mappability, so I didn't touch such last window.

Thanks!

ay-lab commented 3 years ago

"I checked out the #39 and there it seems like any chromosome below bias values all less than 0.5 I should remove actually? " Yes, and when you are removing them please remove the whole chromosome and all fragments and interactions corresponding to it. Generally you can do this simply by "grep -v" but in your case, now that you converted chr names to numbers, you may want to use awk '$1!=429 || $3!=429' for instance for interactions file if you want to remove chr/scaffold 429 and all entries related to it, similar thing you can do for fragments file. I agree that this could be done during conversion by HiCPro2FitHiC.py. However, one may want to use a different bias value threshold range as opposed to 0.5 to 2.0 which is the default. In that case the filtering may remove chrs you still want or not remove the ones you don't want with new threshold etc. Thanks

FatihSarigol commented 3 years ago

Thank you for your suggestions one more time! I honestly forgot that I had removed the scaffold prefixes, so took the hard way, but still useful to keep I guess for species with chromosome names as only numbers. I deleted them after identifying the line numbers and using awk 'NR!~/^(11|429|557|667|889|1033|1455|2222|2245|3122|3762|)$/' in a single command to remove them from both biases and mappability files, and yes I removed all occurences for a scaffold when I removed one, as they all were just a single window scaffold actually.

As far as I understood from my attempts, it was not the bias threshold, but the mapability value of a zero for the all occurences of a scaffold that led to this error, because the scaffolds with a bias of -1 for example which had some mappability value did go through as I kept them and ran the program successfully with them in there.

Thanks