cga-harvard / Data_Science_Big_Data_Projects

Repository for FASRC projects
MIT License
8 stars 3 forks source link

Error loading tables into omni sci in knn_model.py #13

Open jakerbrown opened 4 years ago

jakerbrown commented 4 years ago

While running the modified knn_model.py script I got the following error. It appears to be related to converting the merged table to omnisci, but I do not know what the error message "Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array" means or how to fix it:

(omnisci) [jbrown613@holygpu2c0705 neighbors]$ time python3 ~/sql/knn_model_merge.py Connecting to Omnisci Connected Connection(omnisci://admin:***@localhost:9893/omnisci?protocol=binary) Traceback (most recent call last): File "/n/home09/jbrown613/sql/knn_model_merge.py", line 37, in conn.load_table("m",m,create='infer',method='arrow') File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 687, in load_table return self.load_table_arrow(table_name, data) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 835, in load_table_arrow data, metadata, preserve_index=preserve_index File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/_pandas_loaders.py", line 248, in serialize_arrow_payload data = pa.RecordBatch.from_pandas(data, preserve_index=preserve_index) File "pyarrow/table.pxi", line 704, in pyarrow.lib.RecordBatch.from_pandas File "pyarrow/table.pxi", line 749, in pyarrow.lib.RecordBatch.from_arrays TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

real 5m54.114s user 5m11.406s sys 0m25.939s

dkakkar commented 4 years ago

Is your tablename and dataframe name both "m"?

jakerbrown commented 4 years ago

Yes. Is that an issue?

On Sep 14, 2020, at 1:29 PM, dkakkar notifications@github.com wrote:

Is your tablename and dataframe name both "m"?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692201743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUE34IUH4PDMVTZDXZ3SFZHHVANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Pls share your script with me on email.

dkakkar commented 4 years ago

conn.load_table("mrg",m,create='infer',method='arrow'). Is this the line causing error? Could you try to print "m" by using and share the output:

print(m.head())

jakerbrown commented 4 years ago

Hi Devika,

Yes, it appears that is the line that is causing error. Here is the printed output:

m.head(5) dpost rpost neighbor_id 0 0.0 0.0 AK-630667 1 0.0 0.0 AK-701587 2 0.0 0.0 AK-656813 3 0.0 0.0 AK-656812 4 0.0 0.0 AK-701520

On Sep 14, 2020, at 1:40 PM, dkakkar notifications@github.com wrote:

conn.load_table("mrg",m,create='infer',method='arrow'). Is this the line causing error? Could you try to print "m" by using and share the output:

print(m.head())

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692207359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUFSEICPFAXX2WVUIE3SFZIQNANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Try this instead of "conn.load_table("voters",df,create='infer',method='arrow')":

conn.execute("Create table IF NOT EXISTS mrg (dpost FLOAT, rpost FLOAT, neighbor_id TEXT ENCODING NONE);") conn.load_table_columnar("mrg", m,preserve_index=False)

jakerbrown commented 4 years ago

This seems to work. The current problem arises from reading in the knn output file (in this case it is knn_1000_CA1_2012.tar.gz). It seems to be timing out or hitting a memory limit?

df = pd.read_csv(filename, sep=',',dtype='unicode',index_col=None, low_memory='true',compression='gzip')

Killed

On Sep 14, 2020, at 1:46 PM, dkakkar notifications@github.com wrote:

Try this instead of "conn.load_table("voters",df,create='infer',method='arrow')":

conn.execute("Create table IF NOT EXISTS mrg (dpost FLOAT, rpost FLOAT, neighbor_id TEXT ENCODING NONE);") conn.load_table_columnar("mrg", m,preserve_index=False)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692210681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUGLJN7HMWRFL6X4TQLSFZJHVANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

What GPU memory are you using?Please send parameters of your job.

dkakkar commented 4 years ago

Also, pls test the entire script with a smaller file so that we know if the problem is in the script or memory.

jakerbrown commented 4 years ago

In testing this on the Rhode Island file, I am able to load the knn_1000_RI1_2012.tar.gz file, but it does not look like the data frame we expect:

df.head(5) knn_1000_RI_2012.csv 0 PA-000007920358\tPA-10407918\td\tr\t40.2433976... 1 PA-000007920358\tPA-10408513\td\tr\t40.2433976... 2 PA-000007920358\tPA-000006487459\td\td\t40.243... 3 PA-000007920358\tPA-000006909098\td\td\t40.243... 4 PA-000007920358\tPA-000000307624\td\tr\t40.243...

I think this means the sep is “\t” not “,”?

On Sep 14, 2020, at 5:24 PM, dkakkar notifications@github.com wrote:

Also, pls test the entire script with a smaller file so that we know if the problem is in the script or memory.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692322486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUFFH5DJ4OP6D5O5TADSF2CY3ANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Yes, separator is '\t' but in your script you mentioned ',', no?

jakerbrown commented 4 years ago

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown notifications@github.com wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA .

dkakkar commented 4 years ago

Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar devikakakkar29@gmail.com wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown notifications@github.com wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar notifications@github.com wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar devikakakkar29@gmail.com wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown notifications@github.com wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dkakkar commented 4 years ago

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown notifications@github.com wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar notifications@github.com wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar devikakakkar29@gmail.com wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown notifications@github.com wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA .

dkakkar commented 4 years ago

Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar devikakakkar29@gmail.com wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown notifications@github.com wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar notifications@github.com wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown notifications@github.com wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

On Sep 15, 2020, at 10:25 AM, dkakkar notifications@github.com wrote:

 Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar devikakakkar29@gmail.com wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown notifications@github.com wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar notifications@github.com wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown notifications@github.com wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar notifications@github.com wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dkakkar commented 4 years ago

Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB.

On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown notifications@github.com wrote:

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

On Sep 15, 2020, at 10:25 AM, dkakkar notifications@github.com wrote:

 Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <devikakakkar29@gmail.com

wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <notifications@github.com

wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar notifications@github.com wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < notifications@github.com> wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar < notifications@github.com> wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692751578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.

On Sep 15, 2020, at 10:29 AM, dkakkar notifications@github.com wrote:

Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB.

On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown notifications@github.com wrote:

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

On Sep 15, 2020, at 10:25 AM, dkakkar notifications@github.com wrote:

 Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <devikakakkar29@gmail.com

wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <notifications@github.com

wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar notifications@github.com wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < notifications@github.com> wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar < notifications@github.com> wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692751578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692753214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

If overall session does not die then it might not be GPU memory issue. Please run python script in screen .

On Tue, Sep 15, 2020, 11:03 AM Jacob Brown notifications@github.com wrote:

Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.

On Sep 15, 2020, at 10:29 AM, dkakkar notifications@github.com wrote:

Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB.

On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown notifications@github.com wrote:

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

On Sep 15, 2020, at 10:25 AM, dkakkar notifications@github.com wrote:

 Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar < devikakakkar29@gmail.com

wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown < notifications@github.com

wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar <notifications@github.com

wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < notifications@github.com> wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar < notifications@github.com> wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692751578 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692753214>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692776025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA .

dkakkar commented 4 years ago

Yes I am running it in screen.

On Sep 15, 2020, at 11:05 AM, Devika Kakkar devikakakkar29@gmail.com wrote:

If overall session does not die then it might not be GPU memory issue. Please run python script in screen .

On Tue, Sep 15, 2020, 11:03 AM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.

On Sep 15, 2020, at 10:29 AM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB.

On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

On Sep 15, 2020, at 10:25 AM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

 Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <devikakakkar29@gmail.com mailto:devikakakkar29@gmail.com

wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <notifications@github.com mailto:notifications@github.com

wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < devikakakkar29@gmail.com mailto:devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < notifications@github.com mailto:notifications@github.com> wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar < notifications@github.com mailto:notifications@github.com> wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692751578 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692751578>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA> .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692753214 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692753214>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692776025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

On Tue, Sep 15, 2020 at 11:06 AM Jacob Brown jrbrown@g.harvard.edu wrote:

Yes I am running it in screen.

On Sep 15, 2020, at 11:05 AM, Devika Kakkar devikakakkar29@gmail.com wrote:

If overall session does not die then it might not be GPU memory issue. Please run python script in screen .

On Tue, Sep 15, 2020, 11:03 AM Jacob Brown notifications@github.com wrote:

Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.

On Sep 15, 2020, at 10:29 AM, dkakkar notifications@github.com wrote:

Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB.

On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown notifications@github.com wrote:

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

On Sep 15, 2020, at 10:25 AM, dkakkar notifications@github.com wrote:

 Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue.

On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar < devikakakkar29@gmail.com

wrote:

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt.

On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown < notifications@github.com

wrote:

That is what Im using, I believe.

On Sep 15, 2020, at 10:17 AM, dkakkar < notifications@github.com> wrote:

 Pls use 256 GB ram, 2 CPU, 1GPU machine.

On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < devikakakkar29@gmail.com

wrote:

This seems like memory issue. Pls send me the parameters you used to launch the job.

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < notifications@github.com> wrote:

So I can upload the file, but when I try to load it to Omni Sci the process dies:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> conn.load_table_columnar("knn", df,preserve_index=False) Killed

On Sep 14, 2020, at 11:59 PM, dkakkar < notifications@github.com> wrote:

Yes, separator is '\t' but in your script you mentioned ',', no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692448361

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692462147

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692746187

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692751578 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692753214>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692776025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

dkakkar commented 4 years ago

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown notifications@github.com wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown notifications@github.com wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown notifications@github.com wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown notifications@github.com wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown notifications@github.com wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

This actually did not appear to have solved the issue, as we still have the filename in the first row/column:

df = pd.read_csv('knn_1000_AK1_2012.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 knn_1000_AK1_2012.csv AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown notifications@github.com wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA.

jakerbrown commented 4 years ago

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown notifications@github.com wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Please share the file with me.

On Tue, Sep 15, 2020, 1:33 PM Jacob Brown notifications@github.com wrote:

This actually did not appear to have solved the issue, as we still have the filename in the first row/column:

df = pd.read_csv('knn_1000_AK1_2012.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 knn_1000_AK1_2012.csv AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar notifications@github.com wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692864402, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2E6YFNPDPLTT72NLPDSF6QO3ANCNFSM4RLYKCIA .

dkakkar commented 4 years ago

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown notifications@github.com wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown notifications@github.com wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar notifications@github.com wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA .

dkakkar commented 4 years ago

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar devikakakkar29@gmail.com wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com mailto:notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com mailto:notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com mailto:notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA> .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Here is the code used to make the table:

conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, dist FLOAT, dpost FLOAT, rpost FLOAT);") conn.load_table_columnar("knn", df,preserve_index=False)

On Sep 15, 2020, at 3:53 PM, Devika Kakkar devikakakkar29@gmail.com wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <notifications@github.com mailto:notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com mailto:notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar <notifications@github.com mailto:notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com mailto:notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com mailto:notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA> .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051 https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar devikakakkar29@gmail.com wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar devikakakkar29@gmail.com wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar notifications@github.com wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Pls send me column bames for both tables.

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown notifications@github.com wrote:

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar devikakakkar29@gmail.com wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar < notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693029545, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Knn: source_id neighbor_id dist

mrg: dpost rpost neighbor_id

On Sep 15, 2020, at 7:37 PM, dkakkar notifications@github.com wrote:

Pls send me column bames for both tables.

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown notifications@github.com wrote:

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar devikakakkar29@gmail.com wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown notifications@github.com wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar notifications@github.com wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar < notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693029545, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693052266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Try:

Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);

On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown notifications@github.com wrote:

Knn: source_id neighbor_id dist

mrg: dpost rpost neighbor_id

On Sep 15, 2020, at 7:37 PM, dkakkar notifications@github.com wrote:

Pls send me column bames for both tables.

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown notifications@github.com wrote:

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar < devikakakkar29@gmail.com> wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar notifications@github.com wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar < notifications@github.com> wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar < notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693029545 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693052266>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693092286, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA .

dkakkar commented 4 years ago

Did it work?

On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar devikakakkar29@gmail.com wrote:

Try:

Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);

On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown notifications@github.com wrote:

Knn: source_id neighbor_id dist

mrg: dpost rpost neighbor_id

On Sep 15, 2020, at 7:37 PM, dkakkar notifications@github.com wrote:

Pls send me column bames for both tables.

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown notifications@github.com wrote:

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar < devikakakkar29@gmail.com> wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar <notifications@github.com

wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar < notifications@github.com> wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar < notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693029545 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693052266>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693092286, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA .

jakerbrown commented 4 years ago

Seems to work right now yes. Thank you!

On Sep 16, 2020, at 1:17 PM, dkakkar notifications@github.com wrote:

Did it work?

On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar devikakakkar29@gmail.com wrote:

Try:

Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);

On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown notifications@github.com wrote:

Knn: source_id neighbor_id dist

mrg: dpost rpost neighbor_id

On Sep 15, 2020, at 7:37 PM, dkakkar notifications@github.com wrote:

Pls send me column bames for both tables.

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown notifications@github.com wrote:

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown jrbrown@g.harvard.edu wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar < devikakakkar29@gmail.com> wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar <notifications@github.com

wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar < notifications@github.com> wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar < notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693029545 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693052266>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693092286, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693545154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILUUUAGR2ZPWTUERVGG3ALSGDXLRANCNFSM4RLYKCIA.

dkakkar commented 4 years ago

Just FYI, you were trying to select all columns from both tables previously and in the final merged table you cannot have two columns with the same name (neigbor_id) so it was throwing a unique constraint.

On Wed, Sep 16, 2020 at 1:20 PM Jacob Brown notifications@github.com wrote:

Seems to work right now yes. Thank you!

On Sep 16, 2020, at 1:17 PM, dkakkar notifications@github.com wrote:

Did it work?

On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar devikakakkar29@gmail.com wrote:

Try:

Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);

On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown notifications@github.com wrote:

Knn: source_id neighbor_id dist

mrg: dpost rpost neighbor_id

On Sep 15, 2020, at 7:37 PM, dkakkar notifications@github.com wrote:

Pls send me column bames for both tables.

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown <notifications@github.com

wrote:

Thanks Devika,

That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')

The above exception was the direct cause of the following exception:

[jbrown613@boslogin04 ~]$ File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

On Sep 15, 2020, at 3:57 PM, dkakkar notifications@github.com wrote:

Please use TEXT ENCODING DICT wherever you define it.

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown < jrbrown@g.harvard.edu> wrote:

The data type for source_id is STR

On Sep 15, 2020, at 3:53 PM, Devika Kakkar < devikakakkar29@gmail.com> wrote:

What is the data type for source_id?

On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.

I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 118, in execute at_most_n=-1, File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1755, in sql_execute return self.recv_sql_execute() File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",

line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",

line 390, in execute return c.execute(operation, parameters=parameters) File

"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",

line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

On Sep 15, 2020, at 2:44 PM, dkakkar < notifications@github.com

wrote:

Sure, take your time.

On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.

Thanks,

Jake

On Sep 15, 2020, at 1:11 PM, dkakkar < notifications@github.com> wrote:

Yes.

On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < notifications@github.com> wrote:

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

On Sep 15, 2020, at 1:02 PM, dkakkar < notifications@github.com> wrote:

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe.

On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < notifications@github.com> wrote:

Hi Devika,

After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

Compared to this when reading in the unzipped file:

df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

On Sep 15, 2020, at 11:14 AM, dkakkar < notifications@github.com> wrote:

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens.

On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < notifications@github.com> wrote:

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

On Sep 15, 2020, at 11:09 AM, dkakkar < notifications@github.com> wrote:

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script:

pd.read_csv(filename, chunksize=chunksize)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692780587

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692783279

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692844391

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692847153

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692850861

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692852013

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692875322

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692903051

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692942633

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-692945301

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693029545

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693052266 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693092286 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693545154>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUAGR2ZPWTUERVGG3ALSGDXLRANCNFSM4RLYKCIA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cga-harvard/GIS_Apps_on_HPC/issues/13#issuecomment-693546470, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWCV2HRVXHD7CMGZ63JGCLSGDXVRANCNFSM4RLYKCIA .