aidenlab / straw

Extract data quickly from Juicebox via straw
MIT License
61 stars 36 forks source link

What is 'NaN' in the third column in retreived txt files #121

Open jiangshan529 opened 1 year ago

jiangshan529 commented 1 year ago

Hello, I have retrieved the information from a .hic file. However, I found 'nan' values in the third column. What does this mean? Is this missing values? Why should such value appear? Thanks!

60000 60000 3709.8407084423666 60000 65000 nan 60000 70000 nan 60000 85000 974.9826322352375 85000 85000 6662.110695460637

sa501428 commented 1 year ago

If a particular column was very sparse and KR/SCALE had to discard the column/row when normalizing the matrix, then all entries in the discarded row/column will be NaN.

moshe-olshansky commented 1 year ago

It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map.

jiangshan529 commented 1 year ago

It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map.

Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column.

60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534

moshe-olshansky commented 1 year ago

Are you using oe (observed over expected)? If so, float numbers should not surprise you.

On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 ***@***.***> wrote:  

It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map.

Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column.

60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jiangshan529 commented 1 year ago

Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')'

moshe-olshansky commented 1 year ago

Where does this hic file come from? Is it a (weighted) combination of several maps?

On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 ***@***.***> wrote:  

Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')'

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jiangshan529 commented 1 year ago

Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details

moshe-olshansky commented 1 year ago

I think that you should check how exactly this hic file was created. Maybe read carefully or ask the person in charge.

On Tuesday, 15 November 2022 at 03:02:28 pm AEDT, jiangshan529 ***@***.***> wrote:  

Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

moshe-olshansky commented 1 year ago

By the way, have you tried using dump command in juicer tools? Does it produce identical results (to straw)?

On Tuesday, 15 November 2022 at 03:02:28 pm AEDT, jiangshan529 ***@***.***> wrote:  

Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jiangshan529 commented 1 year ago

By the way, have you tried using dump command in juicer tools? Does it produce identical results (to straw)? On Tuesday, 15 November 2022 at 03:02:28 pm AEDT, jiangshan529 @.> wrote: Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Sorry, I just downloaded the .hic file. How should use the dump command?

sa501428 commented 1 year ago

Please post this question to the forum. We reserve github issues for bugs.

sa501428 commented 1 year ago

https://groups.google.com/g/3d-genomics And that way, the community as a whole will also benefit from the answers. Thanks!

moshe-olshansky commented 1 year ago

Have you downloaded juicer_tools.jar?If so, do java -jar juicer_tools.jar or/and java -jar juicer_tools.jar dump to see the usage.

On Tuesday, 15 November 2022 at 03:20:17 pm AEDT, jiangshan529 ***@***.***> wrote:  

By the way, have you tried using dump command in juicer tools? Does it produce identical results (to straw)? On Tuesday, 15 November 2022 at 03:02:28 pm AEDT, jiangshan529 @.> wrote: Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Sorry, I just downloaded the .hic file. How should use the dump command?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jiangshan529 commented 1 year ago

Have you downloaded juicer_tools.jar?If so, do java -jar juicer_tools.jar or/and java -jar juicer_tools.jar dump to see the usage. On Tuesday, 15 November 2022 at 03:20:17 pm AEDT, jiangshan529 @.> wrote: By the way, have you tried using dump command in juicer tools? Does it produce identical results (to straw)? On Tuesday, 15 November 2022 at 03:02:28 pm AEDT, jiangshan529 @.> wrote: Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Sorry, I just downloaded the .hic file. How should use the dump command? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Hi, I have run it.

When I use 'observed', no float numbers appeared.

when I use 'oe', it looks like this: 60000 60000 3.0029617E-4 60000 65000 9.408559E-4 60000 70000 0.0040292614 60000 85000 0.012957309 85000 85000 0.0078077004 85000 90000 0.015053694 90000 90000 0.015915697 90000 95000 0.011290271 95000 95000 0.008107997

moshe-olshansky commented 1 year ago

Please do what Muhammad suggested and move the conversation to the forum.If straw and juicer_tools dump produce different result it is probably a bug.

On Tuesday, 15 November 2022 at 03:33:30 pm AEDT, jiangshan529 ***@***.***> wrote:  

Have you downloaded juicer_tools.jar?If so, do java -jar juicer_tools.jar or/and java -jar juicer_tools.jar dump to see the usage. On Tuesday, 15 November 2022 at 03:20:17 pm AEDT, jiangshan529 @.> wrote: By the way, have you tried using dump command in juicer tools? Does it produce identical results (to straw)? On Tuesday, 15 November 2022 at 03:02:28 pm AEDT, jiangshan529 @.> wrote: Where does this hic file come from? Is it a (weighted) combination of several maps? On Tuesday, 15 November 2022 at 02:53:12 pm AEDT, jiangshan529 @.> wrote: Are you using oe (observed over expected)? If so, float numbers should not surprise you. On Tuesday, 15 November 2022 at 02:40:34 pm AEDT, jiangshan529 @.> wrote: It means that these rows (bins) had to be removed during normalization. You should not see any NaN (Not a Number) in raw (un-normalized) map. Hi, thanks for your reply. I have run Straw with normalization set to 'None'. And I am not sure this time why there are float numbers in the third column. 60000 60000 1 60000 65000 1 60000 70000 1 60000 85000 1 85000 85000 26 85000 90000 16 90000 90000 53 90000 95000 12 95000 95000 27 58570000 58605000 67.67821603676141 58575000 58605000 179.20274301772898 58580000 58605000 200.11589450569903 58585000 58605000 100.45844430432754 58590000 58605000 237.57597248105807 58595000 58605000 600.7486450401082 58605000 58605000 7401.773853029534 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Hi, the code I am using is 'result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt')' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> It is from the 4D genome project, performed by micro-C method. https://data.4dnucleome.org/files-processed/4DNFI2TK7L2F/#details — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> Sorry, I just downloaded the .hic file. How should use the dump command? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Hi, I have run it.

When I use 'observed', no float numbers appeared.

when I use 'oe', it looks like this: 60000 60000 3.0029617E-4 60000 65000 9.408559E-4 60000 70000 0.0040292614 60000 85000 0.012957309 85000 85000 0.0078077004 85000 90000 0.015053694 90000 90000 0.015915697 90000 95000 0.011290271 95000 95000 0.008107997

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

sa501428 commented 1 year ago

Apologies - if there is indeed a bug (different outputs from straw vs dump), then we should indeed discuss it here. Can you share the commands you used for juicer tools dump and straw, and the respective outputs? Can you also confirm what version of straw was used?

jiangshan529 commented 1 year ago

Apologies - if there is indeed a bug (different outputs from straw vs dump), then we should indeed discuss it here. Can you share the commands you used for juicer tools dump and straw, and the respective outputs? Can you also confirm what version of straw was used?

Hi, I think hicstraw.straw and juicer dump give the same result, but straw.straw gives a different result at end lines.

The code I am using for straw is: result = straw.straw('NONE',"./4DNFI2TK7L2F.hic", "19", "19", "BP", 5000, 'chr19_5k.txt') f1 = open('chr19_5k.txt','w') for i in range(len(result[0])): cmd1= "{0}\t{1}\t{2}\n".format(result[0][i], result[1][i], result[2][i]) f1.write(cmd1)

And there's another code from hicstraw: with open(OutFile, mode='w') as fp_out: result = hicstraw.straw(datatype, Norm, HiCFile, CHR1, CHR2, 'BP', resolution) for i in range(len(result)): print("{0}\t{1}\t{2}\t{3}\t{4}".format(chr1name, (result[i].binX + int(resolution / 2)), chr2name, (result[i].binY + int(resolution / 2)), result[i].counts), file=fp_out)

The code I used for juicer dump is: java -Xmx48000m -Djava.awt.headless=true -jar juicer_tools_1.22.01.jar dump observed NONE 4DNFI2TK7L2F.hic 19 19 BP 5000 >dump.txt.

Interestingly, the result of the top lines are the same by using the three ways:

60000 60000 1.0

60000 65000 1.0

60000 70000 1.0

60000 85000 1.0

85000 85000 26.0

85000 90000 16.0

90000 90000 53.0

90000 95000 12.0

95000 95000 27.0

However, for the end lines, the result are different:

Result of straw.straw:

58570000 58605000 67.67821603676141

58575000 58605000 179.20274301772898

58580000 58605000 200.11589450569903

58585000 58605000 100.45844430432754

58590000 58605000 237.57597248105807

58595000 58605000 600.7486450401082

58605000 58605000 7401.773853029534

Result of hicstraw.straw:

58575000 58605000 14.0

58580000 58605000 19.0

58585000 58605000 4.0

58590000 58605000 2.0

58595000 58605000 2.0

58600000 58605000 1.0

58605000 58605000 52.0

Result of juicer dump(observed):

58575000 58605000 14.0

58580000 58605000 19.0

58585000 58605000 4.0

58590000 58605000 2.0

58595000 58605000 2.0

58600000 58605000 1.0

58605000 58605000 52.0

Result of juicer dump(oe):

60000 60000 3.0029617E-4

60000 65000 9.408559E-4

60000 70000 0.0040292614

60000 85000 0.012957309

85000 85000 0.0078077004

85000 90000 0.015053694

90000 90000 0.015915697

sa501428 commented 1 year ago

is straw.straw c++ and hicstraw.straw python? or what versions are you using?