Closed estyholt closed 5 years ago
Sorry, it didn't work right out of the gate.
I suspect you could get around it by replacing the line chromosomes_in_roman_num = not bool(len([s for s in seqname_set if s.isdigit()])
with chromosomes_in_roman_num = True
or chromosomes_in_roman_num=False
depending if the chromosomes are in roman numeral form or not. (I work mainly with yeast so I don't know the answer off the top of my head.) Of course, if it is true, that may lead to it breaking still. I'll look into more of a long-term solution in a few hours.
I changed it so that the chromosome designation column gets handled as if it a string even if it was reading in integers for the first column. Hopefully, it will work for you now.
When testing, I used the following command to make the GTF file smaller and then used that for testing:
head -199000 Homo_sapiens.GRCh38.88.chr.gtf > truncHomo_sapiens.GRCh38.88.chr.gtf
python plot_expression_across_chromosomes.py truncHomo_sapiens.GRCh38.88.chr.gtf data.tsv --columns 1,9,4 --exp_desig ko --base_desig in
As for speed, I don't think I can improve much the way the script currently reads input files. Pandas is actually reading that 1.4 Gb file and then the script was encountering an error almost immediately after that. In other words, Pandas .read_csv()
method reading the 2.6 million lines of Homo_sapiens.GRCh38.88.chr.gtf is the rate-limiting step there. In theory, if I imposed data types while reading that may speed it up. But GTF files are not always standard, and so I was trying to keep options open. I am going to test some options to see if I can come up with any improvement.
So I am seeing memory improvement with dtypes. In the code example, below first about a tenth of the human gtf file is read in normally and then with dtypes:
!head -266000 Homo_sapiens.GRCh38.88.chr.gtf > truncHomo_sapiens.GRCh38.88.chr.gtf
import pandas as pd
def mem_usage(pandas_obj):
'''
from https://www.dataquest.io/blog/pandas-big-data/
'''
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
col_names_to_apply = ("seqname", "source", "feature type", "start",
"end", "score", "strand", "frame", "attribute")
df = pd.read_csv(
"truncHomo_sapiens.GRCh38.88.chr.gtf", sep='\t', header=None, low_memory=False,
names=col_names_to_apply, comment='#')
print(mem_usage(df))
col_names_to_apply = ("seqname", "source", "feature type", "start",
"end", "score", "strand", "frame", "attribute")
column_types = {"seqname":'category',
"source":'category',
"feature type":'category',
"start":'int64',
"end":'int64',
"score":'category',
"strand":'category',
"frame":'category',
"attribute":'category',
}
df = pd.read_csv(
"truncHomo_sapiens.GRCh38.88.chr.gtf", sep='\t', header=None, low_memory=True,
names=col_names_to_apply, comment='#',dtype=column_types)
print(mem_usage(df))
Results:
225.55 MB
145.74 MB
However, where I was working at the time, I still couldn't even read in the full GTF. I need to test some more later when I can use a system where I can load the entire GTF and verify the speed improvement is there.
Thank you, now it's running and producing a plot. The only problem now is that the order of the chromosomes on the x axis is not sequential.
Yup, I was afraid of that.
However, I like that we made progress. And I am hopeful about this next hiccup since I think I had looked into controlling the order of the x-axis when I was working with roman numerals myself. (You'll note that chromosome number nine [designated by roman numeral IX] indicated is out of order in the current documentation plots.)
I'd definitely like to fix it for the case where chromosomes are designated by integers because that should be easily addressable. Much more so than roman numerals.
Can you tell me how soon it deviates from being sequential? I am going to see if I can reproduce similar output by generating mock data from that 1.4 Mb GTF file because I am trying to avoid asking you to send me some of your data. However, I worry data I generate that way may not recapitulate what you are seeing because I'll be using an essentially ordered list. My address is just my github user name at gmail.com if you had something you felt comfortable sending that produced the non-sequential outcome. Even if it is just the first column of something that produced the issue. I can always fill in mock values.
The other question I have is can you use what you have for now? Or did you need this yesterday and you'd take any kludgy trick I come up with to control the order ?
Backing up to the speed issue, I was trying on a different system and definitely convinced myself to add dtypes. I couldn't even read in that GTF file on a remote system without them. Turns out it isn't just speed. It just won't work under Amazon Linux. It crashes out without dtypes like discussed here.
Improvements were made yesterday to address the speed issue but I forgot to tag this issue in the commit. Reading the GTF and then compiling the necessary information have now been improved, see https://github.com/fomightez/sequencework/commit/2111c66fcf08e5849b8da0587416fdd9e0189136 .
After reading the GTF, I was building the dataframe a row at a time for each gene. That was tolerable with a 12 mb GTF from yeast, but not at all when scaled up to the 1.4 gb from human.
Fortunately the early steps of this script and the mock expression generator script were the same, and so I could improve this one while working on getting human data for addressing the order of the chromosomes along the x-axis. So now I have an idea of the problem you may be seeing. More on that soon.
So I suspect you were seeing something like this?
Or this? (Note that the X
chromosome is after seven and the eleventh is before the tenth.)
Fortunately, I stumbled upon the solution to that while addressing a new issue I had noticed after I added the dtypes. (I am going to explain it here for my own record, but feel free to skip to below to see outcome.) I was trying to address the issue that since adding the dtypes, the plotting code that worked with all chromosomes (see 2nd figure posted here) no longer worked with individual chromosomes or subsets of chromosomes. Turns out that was happening because I made all the text-based columns categoricals and subsetting on just the selected chromosomes was working but when I applied groupby
it was using the original categoricals as the grouping factors and causing errors when I tried to get last row from empty dataframes. The solution to allow the groupby
to work with categorical to only result in groups for those from groups with data was to add observed=True
setting to the groupby
step. However, they working that out I realized by default most text-based columns just get assigned as object
dtype. Thinking that would be better than involving categoricals, I changed all dtypes to that and that enabled me to leave out observed=True
setting to the groupby
step. However, the order of the chromosomes along the x-axis that resulted was atrocious. Things were all over the place, like in the first image I posted in this reply. Fortunately, knowing I could deal with the chromosome names as categoricals by usingobserved=True
setting for the groupby
step to make plotting subsets of chromosomes possible again, I then tried just setting the chromosome name column to a category dtype and leaving all other text-based columns as object
dtype. Amazingly, not only did things work again so that I could plot subsets of chromosomes, now the resulting x-axis order matched the original GTF. See example below:
There are still some minor label placement issues for the MT and higher-numbered chromosomes but that can be sorted out by saving as SVG and editing in vector graphics software. Alternatively, some of it might be solved by adding a way to make the plot larger. However, I think this issue is closed.
Thanks for your detailed answers. Chromosomes order is now fixed.
Thank you for trying it and letting me know. It really helped improve it.
I have made a launchable Jupyter session that will have examples using both the human and yeast GTFs. It can be found here with the static version here. That will make it easier for other folks to use it and for me to test the basics and advanced features more easily.
Unfortunately, in that environment it seems the human section seems to fail due to restrictive computational/memory capacity. (Or at least the human section commands did when I first tried.)
Hi,
I am trying to run the following:
I get:
Here is the beginning of Homo_sapiens.GRCh38.88.chr.gtf:
Here is the beginning of data.tsv, processed as here:
Also, running time is quite long..
Thank you